I'm trying to load the google/gemma-3-27b-it model using Hugging Face's Text Generation Inference (TGI) with Docker on a Windows Server machine equipped with 3 x NVIDIA RTX 3090 GPUs (each 24GB VRAM). My objective is to load the full model (not quantized) and serve it through TGI using multi-GPU parallelism (sharding).
Setup:
- GPUs: 3 x NVIDIA RTX 3090 (24GB each)
- Driver: 560.94
- CUDA: 12.6
- Host OS: Windows Server (with WSL2 backend for Docker Desktop)
- Docker image: ghcr.io/huggingface/text-generation-inference:latest
- Model: google/gemma-3-27b-it (converted and stored locally in gemma-3 directory)
Docker Command I'm Using:
docker run --gpus all \
\-e CUDA_VISIBLE_DEVICES=0,1,2 \
\-p 8080:80 \
\-v $(pwd)/gemma:/data \
ghcr.io/huggingface/text-generation-inference:latest \
\--model-id /data/gemma-3 \
\--num-shard 3
Problem:
- Despite having 3 GPUs and setting --num-shard 3, the container fails to load the model.
What I’ve Tried:
Setting
--num-shard 3
andCUDA_VISIBLE_DEVICES=0,1,2
Verified that my model folder contains config, tokenizer, and .safetensors weights (approx. 44GB total)
All 3 GPUs are available and mostly idle. Driver and CUDA versions are compatible.
Error I am Facing
- torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3144, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 1 'out of memory'
Question:
- Are there any settings I'm missing, or is there a known issue that might be stopping TGI from properly splitting (sharding) the model across all my GPUs when running inside Docker?
Goal:
Load and serve the full-precision google/gemma-3-27b-it model across 3 GPUs using TGI, preferably in Docker, for inference at runtime (not quantized).