I am facing an issue with a process that holds GPU memory even after I have terminated it. Here's a detailed breakdown of the situation:
The process (a CUDA application) is running and occupies GPU memory.
When I stop the process, it disappears from nvidia-smi
and gpustat
, but it still holds GPU memory and utilization rate is 100%, just like this:
[6] NVIDIA A100 80GB PCIe | 52'C, 100 % | 10151 / 81920 MB | (null)
Using nvidia-smi
and gpustat
cann't shows PID but using nvidia-smi --query-compute-apps=pid,used_memory --format=csv
can shows the PID of the process still occupying memory, but:
When I try to kill it using kill -9 <pid>, I get the error: no such process.
The process is not shown as a zombie or defunct process in standard process listings (ps, top).
Driver Version: 535.183.01
CUDA Version: 12.2
GPU: nvidia A100 80G
This issue persists, and I cannot free up the GPU memory. Have you encountered this problem before? How can I forcefully reclaim the GPU memory or kill such processes when kill -9 doesn't seem to work?
Any suggestions or insights on how to resolve this would be greatly appreciated.
Thanks in advance!