distributed computing - PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interfa

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on the worker node(s):

[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/user/workspace/ddp/main.py", line 159, in <module>
[rank4]:     main()
[rank4]:   File "/home/user/workspace/ddp/main.py", line 90, in main
[rank4]:     ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)
[rank4]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank4]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank4]:   File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes
[rank4]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank4]: ncclInternalError: Internal check failed.
[rank4]: Last error:
[rank4]: Bootstrap : no socket interface found
[rank4]:[W131 14:34:49.202068506 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see .html#shutdown (function operator())
W0131 14:34:49.846516 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700596 closing signal SIGTERM
W0131 14:34:49.847558 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700598 closing signal SIGTERM
E0131 14:34:49.944460 2700574 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2700595) of binary: /home/user/workspace/ddp/.venv3.11/bin/python3.11
Traceback (most recent call last):
  File "/home/user/workspace/ddp/.venv3.11/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-31_14:34:49
  host      : *****
  rank      : 6 (local_rank: 2)
  exitcode  : 1 (pid: 2700597)
  error_file: <N/A>
  traceback : To enable traceback see: .html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-31_14:34:49
  host      : *****
  rank      : 4 (local_rank: 0)
  exitcode  : 1 (pid: 2700595)
  error_file: <N/A>
  traceback : To enable traceback see: .html
============================================================

Environment:

PyTorch: 2.6.0
NCCL: 2.21.5
CUDA: 12.4
Python: 3.11

I tried changing the DDP backend from nccl to gloo in my argument parser:

parser.add_argument("--backend", type=str, default="nccl", choices=["nccl", "gloo", "mpi"], help="DDP backend")

When I set --backend=gloo, the script runs without errors, but it runs on the CPU instead of the GPU. Since I need GPU acceleration, I must use nccl, but that's where the error occurs.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

distributed computing - PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interfa

与本文相关的文章

评论列表(0)