最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

docker - Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1) - Stack Overflow

programmeradmin3浏览0评论

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!

I'm running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU. I'm not sure what's going wrong, and would really appreciate any guidance.

My Environment:

GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem During training, the script suddenly crashes with a segmentation fault. The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device). It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, modelpute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question** I'm still new to CUDA programming and PyTorch internals, so I’m not sure: Why might this segmentation fault occur? Am I doing something wrong when moving data to the GPU? Is there a safer or more proper way to handle tensors before calling .backward()? Any help or explanation would be really appreciated. Thank you in advance!

Share Improve this question asked Mar 28 at 4:22 탁승연탁승연 1
Add a comment  | 

1 Answer 1

Reset to default -1

Here’s a modified version of your train_test_ht_sl function with some of the suggestions applied:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # Ensure targets are on the correct device
        targets = torch.tensor(targets, dtype=torch.long, device=device)
        
        # Check for NaNs or Infs
        assert not torch.isnan(targets).any(), "Targets contain NaNs"
        assert not torch.isinf(targets).any(), "Targets contain Infs"

        loss = model.loss_function(scores, targets - 1)
        
        # Check for NaNs in loss
        assert not torch.isnan(loss).any(), "Loss contains NaNs"
        
        loss.backward()  # <- crash sometimes happens here

def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A =
发布评论

评论列表(0)

  1. 暂无评论