I’m working through an example of an encoder–decoder seq2seq model in TensorFlow with Bahdanau-style Attention. I followed a tutorial/book example that uses a TextVectorization
layer with output_sequence_length=max_length
. When I keep max_length
, everything works fine. However, if I remove it because I want to handle variable-length input sequences, the model throws a INVALID_ARGUMENT: required broadcastable shapes
error during training.
This makes me wonder why my char-level RNN model didn’t require a fixed max_length
, but my attention-based seq2seq suddenly does. If I look at the error stack trace, it points to a shape mismatch in sparse_categorical_crossentropy/weighted_loss/Mul
.
import tensorflow as tf
# Sample data
sentences_en = ["I love dogs", "You love cats", "We like soccer"]
sentences_es = ["me gustan los perros", "te gustan los gatos", "nos gusta el futbol"]
# TextVectorization without max_length
vocab_size = 1000
text_vec_layer_en = tf.keras.layers.TextVectorization(
max_tokens=vocab_size
# NOTE: Removed output_sequence_length to allow variable-length
)
text_vec_layer_es = tf.keras.layers.TextVectorization(
max_tokens=vocab_size
# NOTE: Also removed output_sequence_length here
)
text_vec_layer_en.adapt(sentences_en)
text_vec_layer_es.adapt(["startofseq " + s + " endofseq" for s in sentences_es])
# Model Inputs
encoder_inputs = tf.keras.layers.Input(shape=(), dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
embed_size = 128
embedding_en = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)
embedding_es = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)
encoder_embeds = embedding_en(encoder_input_ids)
decoder_embeds = embedding_es(decoder_input_ids)
# Bidirectional Encoder
encoder = tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
)
encoder_outputs, forward_h, forward_c, backward_h, backward_c = encoder(encoder_embeds)
encoder_state_h = tf.concat([forward_h, backward_h], axis=-1)
encoder_state_c = tf.concat([forward_c, backward_c], axis=-1)
encoder_state = [encoder_state_h, encoder_state_c]
# Decoder
decoder = tf.keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeds, initial_state=encoder_state)
# Attention
attention_layer = tf.keras.layers.Attention()
attention_outputs = attention_layer([decoder_outputs, encoder_outputs])
# Final Dense Output
output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')
y_proba = output_layer(attention_outputs)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[y_proba])
modelpile(loss='sparse_categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])
# Attempted training
# (Simplifying example: ignoring actual train split, just trying to run a dummy training step)
x_en = tf.constant(sentences_en)
x_es = tf.constant(["startofseq " + s + " endofseq" for s in sentences_es])
y_dummy = text_vec_layer_es(x_es) # shape mismatch expected if variable
model.fit([x_en, x_es], y_dummy, epochs=1)
Error:
INVALID_ARGUMENT: required broadcastable shapes
[[node gradient_tape/sparse_categorical_crossentropy/weighted_loss/Mul]]
...
Here's the seq2seq char RNN model which doesn't require a max_length in text vectorization layer.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
tf.keras.layers.GRU(128, return_sequences=True),
tf.keras.layers.Dense(n_tokens, activation="softmax")
])
Why do char-level RNNs or simple seq2seq sometimes work without specifying max_length, but this attention-based model does not?
Is this primarily because attention needs to be calculated over all input and output steps so max_length
is important ahead of time?