huggingface transformers - Unable to Load googlegemma-3-27b-it on 3 x RTX 3090 GPUs using TGI in Docker

I'm trying to load the google/gemma-3-27b-it model using Hugging Face's Text Generation Inference (TGI) with Docker on a Windows Server machine equipped with 3 x NVIDIA RTX 3090 GPUs (each 24GB VRAM). My objective is to load the full model (not quantized) and serve it through TGI using multi-GPU parallelism (sharding).

Setup:

GPUs: 3 x NVIDIA RTX 3090 (24GB each)
Driver: 560.94
CUDA: 12.6
Host OS: Windows Server (with WSL2 backend for Docker Desktop)
Docker image: ghcr.io/huggingface/text-generation-inference:latest
Model: google/gemma-3-27b-it (converted and stored locally in gemma-3 directory)

Docker Command I'm Using:

docker run --gpus all \
  \-e CUDA_VISIBLE_DEVICES=0,1,2 \
  \-p 8080:80 \
  \-v $(pwd)/gemma:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  \--model-id /data/gemma-3 \
  \--num-shard 3

Problem:

Despite having 3 GPUs and setting --num-shard 3, the container fails to load the model.

What I’ve Tried:

Setting --num-shard 3 and CUDA_VISIBLE_DEVICES=0,1,2
Verified that my model folder contains config, tokenizer, and .safetensors weights (approx. 44GB total)
All 3 GPUs are available and mostly idle. Driver and CUDA versions are compatible.

Error I am Facing

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3144, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 1 'out of memory'

Question:

Are there any settings I'm missing, or is there a known issue that might be stopping TGI from properly splitting (sharding) the model across all my GPUs when running inside Docker?

Goal:

Load and serve the full-precision google/gemma-3-27b-it model across 3 GPUs using TGI, preferably in Docker, for inference at runtime (not quantized).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

huggingface transformers - Unable to Load googlegemma-3-27b-it on 3 x RTX 3090 GPUs using TGI in Docker - Stack Overflow

与本文相关的文章

评论列表(0)