Thanks so much in advance,
After a graceful restart of nodes, I'm experiencing an unusual access denied error on the pvc used for llm model cache stored on a local-nfs storage class.
Warning FailedMount 16m kubelet MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: Created symlink /run/systemd/system/remote-fs.target.wants/rpc-statd.service → /lib/systemd/system/rpc-statd.service.
mount.nfs: Operation not permitted
Warning FailedMount 16m kubelet MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: Operation not permitted
Warning FailedMount 15s (x14 over 16m) kubelet MountVolume.SetUp failed for volume "pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o retrans=2,timeo=30,vers=3 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69 /var/lib/kubelet/pods/70e3e22b-dd08-4945-a039-a9ce107e525d/volumes/kubernetes.io~nfs/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
Output: mount.nfs: access denied by server while mounting 10.101.156.22:/export/pvc-8d73fc95-b785-4e12-b47a-c8d1c3d12f69
This is causing pods to be stuck in ContainerCreating status.
videosearch vss-blueprint-0 0/1 ContainerCreating 0 20h <none> worker-1 <none>
videosearch vss-vss-deployment-5f758bc5df-fbm66 0/1 Init:0/3 0 21h <none> worker-1 <none>
vllm llama3-70b-bc4788446-9q8c2 0/1 ContainerCreating 0 21h <none> worker-2 <none>
The pv and pvc are both healthy, it seems just the mount command that the pods are issuing is failing.
My previous solution was to delete the pv and pvc and then redeploy the entire helm chart, but this is not ideal to have to redeploy a major workload after restart.
Would anyone happen to have a suggestion for something like this?