最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How can I free the nvidia gpu memory allocated by tensorflow (2.17.0) in my running jupyter notebook? - Stack Overflow

programmeradmin0浏览0评论

I have a jupyter notebook running on the kernel opt/conda/bin/python in my Google Compute Engine machine (Debian).

The first cell of my notebook loads image data from the disk and saves it in the variables train_images, train_labels, etc.

The next cell creates a model with keras and trains it:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Input
from tensorflow.keras.optimizers import Adam

model.add(Input(shape=(img_size, img_size, 3)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(len(categories), activation='softmax'))
modelpile(optimizer=Adam(learning_rate=0.0001), 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

history = model.fit(train_images, 
                    train_labels, 
                    epochs=20,
                    batch_size=32,
                    validation_data=(validate_images, validate_labels)) 

My workflow is running the first cell once to load my data and then running the second cell multiple times, trying different hyperparameters. However, after doing this ~5 times I get:

2025-02-08 14:00:15.818993: W external/local_tsl/tsl/framework/bfc_allocator:482] Allocator (GPU_0_bfc) ran out of memory trying to allocate 502MiB (rounded to 1207959552) requested by op StatelessRandomUniformV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 

I've tried tf.keras.backend.clear_session(), gc.collect() and setting the env var as suggested in the error message but the only thing that works is restarting the kernel. This is very annoying because the data loading in my first cell takes pretty long and I'd like to only have to do that one time.

How can I tell tensorflow to clear the gpu memory from previous cell executions or reset the gpu? I do not need the allocated memory from old trials, why does it accumulate?

I came across this GitHub issue with people suggesting hacky solutions like running the training in a seperate process and terminating the process afterwards but there has to be a better way.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论