docker - Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1)

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here

def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here

def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

Share Improve this question asked Mar 28 at 4:22 탁승연 1

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

Here’s a modified version of your train_test_ht_sl function with some of the suggestions applied:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # Ensure targets are on the correct device
        targets = torch.tensor(targets, dtype=torch.long, device=device)
        
        # Check for NaNs or Infs
        assert not torch.isnan(targets).any(), "Targets contain NaNs"
        assert not torch.isinf(targets).any(), "Targets contain Infs"

        loss = model.loss_function(scores, targets - 1)
        
        # Check for NaNs in loss
        assert not torch.isnan(loss).any(), "Loss contains NaNs"
        
        loss.backward()  # <- crash sometimes happens here

def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A =

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

docker - Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1) - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)