cuda - nvidia-container-cli: initialization error occurs when restarting docker container after using GPU outside docker contain

I am running tritonserver which uses RTX4090 inside a docker container with the following command:

sudo docker run --gpus='"device=0"' -e CUDA_VISIBLE_DEVICES=0 -d --shm-size=1g --ulimit memlock=-1
-p 8010:8000 -p 8011:8001 -p 8012:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:24.08-py3

It works totally fine but whenever I do some cuda computation with GPU outside the docker and then stop the container and restart it, I encounter the following error:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

nvidia-smi command on my host machine shows:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   35C    P8              19W / 500W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

and nvcc -V shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

and I still can use GPU and cuda computation outside the docker container.

To resolve the problem, I deleted docker and then reinstalled it and then re-created the container which worked totally fine. But whenever I run some code in the host machine and restart the container, I encounter the same problem again.

I need to stop and start the same docker container frequently for my development, so I cannot delete and reinstall docker again and again... is there any way to solve this issue?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

cuda - nvidia-container-cli: initialization error occurs when restarting docker container after using GPU outside docker contain

与本文相关的文章

评论列表(0)