I am currently running on my university's computing nodes to train my pytorch model. My data is on the University's remote filesystem as well. I have num_workers>0 and multiple runs going on in parallel. Although I have never had this problem before, now all of my runs seem to crash with this error:
PermissionError: Caught PermissionError in DataLoader worker process 6.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/MASynth/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/root/miniconda3/envs/MASynth/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/envs/MASynth/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/remote/fs/users/UNet/code/mat_unet/version_4/data_3.py", line 183, in __getitem__
sample['semantic']= to_tensor(normalize_images(np.expand_dims(cv2.resize(read_npz(semantic_dir), dsize=(256, 256), interpolation=cv2.INTER_NEAREST), axis=0), max_val=40))
File "/remote/fs/users/users/UNet/code/mat_unet/version_4/utils.py", line 469, in read_npz
with np.load(file) as data:
File "/root/miniconda3/envs/MASynth/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
PermissionError: [Errno 13] Permission denied: '/remote/fs/datasets/dataset_name/version_2.0/folder1/folder2/file.npz
All my runs crash at different times with pointers to different files. What could be causing this and how can I fix it?
TIA
Previously, I was able to run multiple parallel processes on the same dataset without any issues. I have checked all permissions necessary for this dataset, and they are fine. I have been trying my best to ensure that my code is bug-free but still not able to get around the 'Permission Error'.