python - OOM error on GPU while training a RNN in model selection phase, tensor dimensions issue or wrong code?

I have a NLP tensor such as train:(22k, 170, 300) val: (2k, 170, 300), test: (25k, 170,300) where the last dim 300 are FastText embs, also I have one GPU Tesla 32GB. I'm doing model selection on a RNN untrained an the batch/buffer_size is 64 the layers are 5:

for config in param_grid:
                model = self.create_model(config)
                train = model.forward(embedded_training_data)
                val = model.forward(embedded_val_data)
                test= model.forward(embedded_test_data)

The model created is a Keras sequential where each layers are keras.layers.Bidirectional and sequence=True hence output is 3D (batch, timesteps, features), the forward method is the following and uses a batches computation called compute_states:

def compute_states(self, x):
        x_train_states = []  
        for i, layer in enumerate(self.layers):
                outputs, r, b = layer(x)
                x_train_states.append(outputs)  # Aggiungi alla lista
            
            x = outputs  # Update input for next layer
        
        return tf.concat(x_train_states, axis=2) if x_train_states else None


@tf.function
    def forward(self, data):
        total_samples = tf.shape(data)[0]
        buffer_size = tf.constant(self.buffer_size, dtype=tf.int32)
        num_batches = tf.cast(tf.math.ceil(total_samples / buffer_size), tf.int32)
        
        states_array = tf.TensorArray(dtype=tf.float32, size=num_batches)
        
        for i in tf.range(num_batches):
            start_idx = i * buffer_size
            end_idx = tf.minimum((i + 1) * buffer_size, total_samples)
            batch = data[start_idx:end_idx]
            states = selfpute_states(batch)
            states_array = states_array.write(i, states)

        states = states_array.concat()
        return states

These functions are good and very speed on CPU but on GPU I receive OOM error on the concatenation of batches (states_array.concat()). I'd like to know if there is any issues in my code, and therefore I could optimize it or if the tensors dimensions are intractable.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - OOM error on GPU while training a RNN in model selection phase, tensor dimensions issue or wrong code? - Stack Overflow

与本文相关的文章

评论列表(0)