I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1
model, quantized to 4-bit using BitsAndBytesConfig
. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.
Please use the model as it is...
What I’ve Done So Far:
I'm not calling
.to(...)
anywhere — explicitly removed all such lines.✅I'm using
quantization_config=BitsAndBytesConfig(...)
withload_in_4bit=True
.✅I removed
device_map="auto"
as per the transformers GitHub issue✅I'm calling
.cuda()
only once on the model after.from_pretrained(...)
, as suggested ✅Model and tokenizer are being loaded via Hugging Face Hub with
HF_TOKEN
properly set ✅The system detects CUDA correctly:
torch.cuda.is_available()
isTrue
✅
and last, I cleared the Hugging Face cache (~/.cache/huggingface
) and re-ran everything ✅
Here’s the relevant part of the code that triggers the error:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=quant_config,
device_map=None, # I explicitly removed this
token=hf_token
).cuda() # This is the only use of `.cuda()`
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
Yet I still get the same ValueError
.
I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1
model, quantized to 4-bit using BitsAndBytesConfig
. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.
Please use the model as it is...
What I’ve Done So Far:
I'm not calling
.to(...)
anywhere — explicitly removed all such lines.✅I'm using
quantization_config=BitsAndBytesConfig(...)
withload_in_4bit=True
.✅I removed
device_map="auto"
as per the transformers GitHub issue✅I'm calling
.cuda()
only once on the model after.from_pretrained(...)
, as suggested ✅Model and tokenizer are being loaded via Hugging Face Hub with
HF_TOKEN
properly set ✅The system detects CUDA correctly:
torch.cuda.is_available()
isTrue
✅
and last, I cleared the Hugging Face cache (~/.cache/huggingface
) and re-ran everything ✅
Here’s the relevant part of the code that triggers the error:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=quant_config,
device_map=None, # I explicitly removed this
token=hf_token
).cuda() # This is the only use of `.cuda()`
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
Yet I still get the same ValueError
.
1 Answer
Reset to default 0The ValueError error persists because calling .cuda()
on a 4-bit quantized model is not allowed with the BitsAndBytes integration in Hugging Face Transformers. When you use load_in_4bit=True
in BitsAndBytesConfig
, the model is automatically placed on the GPU (if available) during loading, and subsequent calls to .cuda()
or .to()
are unsupported and will raise this error.
Since you've already removed device_map="auto"
and confirmed CUDA is detected (torch.cuda.is_available() == True
), the issue lies in the .cuda()
call after from_pretrained()
. For 4-bit models, you should avoid manually moving the model to the GPU since BitsAndBytes handles this internally.
Update your code to remove the .cuda()
call entirely, and you're good then.