最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

large language model - Access denied on pvc mount after Kubernetes cluster worker node reboot - Stack Overflow

programmeradmin1浏览0评论

Thanks so much in advance,

After a graceful restart of nodes, I'm experiencing an unusual access denied error on the pvc used for llm model cache stored on a local-nfs storage class.

  Warning  FailedMount       16m                  kubelet            MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.
mount.nfs: Operation not permitted
  Warning  FailedMount  16m  kubelet  MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: Operation not permitted
  Warning  FailedMount  15s (x14 over 16m)  kubelet  MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: access denied by server while mounting 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69

This is causing pods to be stuck in ContainerCreating status.

videosearch        vss-blueprint-0                                                   0/1     ContainerCreating   0              20h    <none>            worker-1    <none>
videosearch        vss-vss-deployment-5f758bc5df-fbm66                               0/1     Init:0/3            0              21h    <none>            worker-1    <none>
vllm               llama3-70b-bc4788446-9q8c2                                        0/1     ContainerCreating   0              21h    <none>            worker-2    <none>

The pv and pvc are both healthy, it seems just the mount command that the pods are issuing is failing.

My previous solution was to delete the pv and pvc and then redeploy the entire helm chart, but this is not ideal to have to redeploy a major workload after restart.

Would anyone happen to have a suggestion for something like this?

发布评论

评论列表(0)

  1. 暂无评论