I have a jupyter notebook running on the kernel opt/conda/bin/python
in my Google Compute Engine machine (Debian).
The first cell of my notebook loads image data from the disk and saves it in the variables train_images, train_labels, etc.
The next cell creates a model with keras and trains it:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Input
from tensorflow.keras.optimizers import Adam
model.add(Input(shape=(img_size, img_size, 3)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(len(categories), activation='softmax'))
modelpile(optimizer=Adam(learning_rate=0.0001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_images,
train_labels,
epochs=20,
batch_size=32,
validation_data=(validate_images, validate_labels))
My workflow is running the first cell once to load my data and then running the second cell multiple times, trying different hyperparameters. However, after doing this ~5 times I get:
2025-02-08 14:00:15.818993: W external/local_tsl/tsl/framework/bfc_allocator:482] Allocator (GPU_0_bfc) ran out of memory trying to allocate 502MiB (rounded to 1207959552) requested by op StatelessRandomUniformV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
I've tried tf.keras.backend.clear_session()
, gc.collect()
and setting the env var as suggested in the error message but the only thing that works is restarting the kernel. This is very annoying because the data loading in my first cell takes pretty long and I'd like to only have to do that one time.
How can I tell tensorflow to clear the gpu memory from previous cell executions or reset the gpu? I do not need the allocated memory from old trials, why does it accumulate?
I came across this GitHub issue with people suggesting hacky solutions like running the training in a seperate process and terminating the process afterwards but there has to be a better way.