最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

huggingface transformers - Unable to Load googlegemma-3-27b-it on 3 x RTX 3090 GPUs using TGI in Docker - Stack Overflow

programmeradmin3浏览0评论

I'm trying to load the google/gemma-3-27b-it model using Hugging Face's Text Generation Inference (TGI) with Docker on a Windows Server machine equipped with 3 x NVIDIA RTX 3090 GPUs (each 24GB VRAM). My objective is to load the full model (not quantized) and serve it through TGI using multi-GPU parallelism (sharding).

Setup:

  • GPUs: 3 x NVIDIA RTX 3090 (24GB each)
  • Driver: 560.94
  • CUDA: 12.6
  • Host OS: Windows Server (with WSL2 backend for Docker Desktop)
  • Docker image: ghcr.io/huggingface/text-generation-inference:latest
  • Model: google/gemma-3-27b-it (converted and stored locally in gemma-3 directory)

Docker Command I'm Using:

docker run --gpus all \
  \-e CUDA_VISIBLE_DEVICES=0,1,2 \
  \-p 8080:80 \
  \-v $(pwd)/gemma:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  \--model-id /data/gemma-3 \
  \--num-shard 3

Problem:

  • Despite having 3 GPUs and setting --num-shard 3, the container fails to load the model.

What I’ve Tried:

  • Setting --num-shard 3 and CUDA_VISIBLE_DEVICES=0,1,2

  • Verified that my model folder contains config, tokenizer, and .safetensors weights (approx. 44GB total)

  • All 3 GPUs are available and mostly idle. Driver and CUDA versions are compatible.

Error I am Facing

  • torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3144, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 1 'out of memory'

Question:

  • Are there any settings I'm missing, or is there a known issue that might be stopping TGI from properly splitting (sharding) the model across all my GPUs when running inside Docker?

Goal:

Load and serve the full-precision google/gemma-3-27b-it model across 3 GPUs using TGI, preferably in Docker, for inference at runtime (not quantized).

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论