最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

memory management - OutOfMemoryError with PatchCore Training on 23.67 GiB GPU - Stack Overflow

programmeradmin1浏览0评论

I’m training a PatchCore model with an image size of 128x512 on a GPU with 23.67 GiB memory. However, I’m encountering the following error:

CUDA Version: 12.4
PyTorch Version: 2.5.1

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.17 GiB. GPU 0 has a total capacity of 23.67 GiB of which 47.88 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.29 GiB is allocated by PyTorch, and 15.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management.

Configuration (yaml):

data:
  class_path: anomalib.data.Folder
  init_args:
    name: train_data
    root: ""
    image_size:
      - 128
      - 512
    normal_dir: ""
    abnormal_dir: ""
    normal_test_dir: ""
    mask_dir: ""
    normal_split_ratio: 0
    extensions: [".png"]
    train_batch_size: 4
    eval_batch_size: 4
    num_workers: 8
    train_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.RandomAdjustSharpness
            init_args:
              sharpness_factor: 0.7
              p: 0.5
          - class_path: torchvision.transforms.v2.RandomHorizontalFlip
            init_args:
              p: 0.5
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]
    eval_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]

model:
  class_path: anomalib.models.Patchcore
  init_args:
    backbone: wide_resnet50_2
    layers:
      - layer2
      - layer3
    pre_trained: true
    coreset_sampling_ratio: 0.1
    num_neighbors: 9

Steps I’ve Tried:

Lowering the batch size: I reduced the batch size to as low as 1, but the issue persists.

Checking for memory fragmentation: Followed the suggestion in the error to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. However, this did not solve the problem.

Ensuring no memory leakage: Verified that no other processes are consuming GPU memory using nvidia-smi, but the allocated memory remains maxed out during training.

Questions:

Are there specific optimizations for PatchCore or PyTorch that can help reduce memory usage?

I’m training a PatchCore model with an image size of 128x512 on a GPU with 23.67 GiB memory. However, I’m encountering the following error:

CUDA Version: 12.4
PyTorch Version: 2.5.1

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.17 GiB. GPU 0 has a total capacity of 23.67 GiB of which 47.88 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.29 GiB is allocated by PyTorch, and 15.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management.

Configuration (yaml):

data:
  class_path: anomalib.data.Folder
  init_args:
    name: train_data
    root: ""
    image_size:
      - 128
      - 512
    normal_dir: ""
    abnormal_dir: ""
    normal_test_dir: ""
    mask_dir: ""
    normal_split_ratio: 0
    extensions: [".png"]
    train_batch_size: 4
    eval_batch_size: 4
    num_workers: 8
    train_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.RandomAdjustSharpness
            init_args:
              sharpness_factor: 0.7
              p: 0.5
          - class_path: torchvision.transforms.v2.RandomHorizontalFlip
            init_args:
              p: 0.5
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]
    eval_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]

model:
  class_path: anomalib.models.Patchcore
  init_args:
    backbone: wide_resnet50_2
    layers:
      - layer2
      - layer3
    pre_trained: true
    coreset_sampling_ratio: 0.1
    num_neighbors: 9

Steps I’ve Tried:

Lowering the batch size: I reduced the batch size to as low as 1, but the issue persists.

Checking for memory fragmentation: Followed the suggestion in the error to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. However, this did not solve the problem.

Ensuring no memory leakage: Verified that no other processes are consuming GPU memory using nvidia-smi, but the allocated memory remains maxed out during training.

Questions:

Are there specific optimizations for PatchCore or PyTorch that can help reduce memory usage?

Share Improve this question asked Jan 23 at 20:39 林芷翎林芷翎 1
Add a comment  | 

1 Answer 1

Reset to default 0

Have you tried using mixed precision?

You can usually set it using precision="16-mixed" in a Lightning trainer. anomalib seem to have implemented a way to use it during deployment.

发布评论

评论列表(0)

  1. 暂无评论