cuda - How to deal with a process holding nvidia GPU memory after termination？

I am facing an issue with a process that holds GPU memory even after I have terminated it. Here's a detailed breakdown of the situation:

The process (a CUDA application) is running and occupies GPU memory.

When I stop the process, it disappears from nvidia-smi and gpustat, but it still holds GPU memory and utilization rate is 100%, just like this:

[6] NVIDIA A100 80GB PCIe | 52'C, 100 % | 10151 / 81920 MB | (null)

Using nvidia-smi and gpustat cann't shows PID but using nvidia-smi --query-compute-apps=pid,used_memory --format=csv can shows the PID of the process still occupying memory, but:

When I try to kill it using kill -9 <pid>, I get the error: no such process.

The process is not shown as a zombie or defunct process in standard process listings (ps, top).

Driver Version: 535.183.01   
CUDA Version: 12.2
GPU: nvidia A100 80G

This issue persists, and I cannot free up the GPU memory. Have you encountered this problem before? How can I forcefully reclaim the GPU memory or kill such processes when kill -9 doesn't seem to work?

Any suggestions or insights on how to resolve this would be greatly appreciated.

Thanks in advance!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

cuda - How to deal with a process holding nvidia GPU memory after termination？ - Stack Overflow

与本文相关的文章

评论列表(0)