python - How can I free the nvidia gpu memory allocated by tensorflow (2.17.0) in my running jupyter notebook?

I have a jupyter notebook running on the kernel opt/conda/bin/python in my Google Compute Engine machine (Debian).

The first cell of my notebook loads image data from the disk and saves it in the variables train_images, train_labels, etc.

The next cell creates a model with keras and trains it:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Input
from tensorflow.keras.optimizers import Adam

model.add(Input(shape=(img_size, img_size, 3)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(len(categories), activation='softmax'))
modelpile(optimizer=Adam(learning_rate=0.0001), 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

history = model.fit(train_images, 
                    train_labels, 
                    epochs=20,
                    batch_size=32,
                    validation_data=(validate_images, validate_labels))

My workflow is running the first cell once to load my data and then running the second cell multiple times, trying different hyperparameters. However, after doing this ~5 times I get:

2025-02-08 14:00:15.818993: W external/local_tsl/tsl/framework/bfc_allocator:482] Allocator (GPU_0_bfc) ran out of memory trying to allocate 502MiB (rounded to 1207959552) requested by op StatelessRandomUniformV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

I've tried tf.keras.backend.clear_session(), gc.collect() and setting the env var as suggested in the error message but the only thing that works is restarting the kernel. This is very annoying because the data loading in my first cell takes pretty long and I'd like to only have to do that one time.

How can I tell tensorflow to clear the gpu memory from previous cell executions or reset the gpu? I do not need the allocated memory from old trials, why does it accumulate?

I came across this GitHub issue with people suggesting hacky solutions like running the training in a seperate process and terminating the process afterwards but there has to be a better way.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How can I free the nvidia gpu memory allocated by tensorflow (2.17.0) in my running jupyter notebook? - Stack Overflow

与本文相关的文章

评论列表(0)