large language model - Access denied on pvc mount after Kubernetes cluster worker node reboot

Thanks so much in advance,

After a graceful restart of nodes, I'm experiencing an unusual access denied error on the pvc used for llm model cache stored on a local-nfs storage class.

  Warning  FailedMount       16m                  kubelet            MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.
mount.nfs: Operation not permitted
  Warning  FailedMount  16m  kubelet  MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: Operation not permitted
  Warning  FailedMount  15s (x14 over 16m)  kubelet  MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: access denied by server while mounting 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69

This is causing pods to be stuck in ContainerCreating status.

videosearch        vss-blueprint-0                                                   0/1     ContainerCreating   0              20h    <none>            worker-1    <none>
videosearch        vss-vss-deployment-5f758bc5df-fbm66                               0/1     Init:0/3            0              21h    <none>            worker-1    <none>
vllm               llama3-70b-bc4788446-9q8c2                                        0/1     ContainerCreating   0              21h    <none>            worker-2    <none>

The pv and pvc are both healthy, it seems just the mount command that the pods are issuing is failing.

My previous solution was to delete the pv and pvc and then redeploy the entire helm chart, but this is not ideal to have to redeploy a major workload after restart.

Would anyone happen to have a suggestion for something like this?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

large language model - Access denied on pvc mount after Kubernetes cluster worker node reboot - Stack Overflow

与本文相关的文章

评论列表(0)