最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - FastAPI + Transformers + 4-bit Mistral: .to() is not supported for bitsandbytes 4-bit models error - Stack Overflow

programmeradmin2浏览0评论

I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. 
Please use the model as it is...

What I’ve Done So Far:

  • I'm not calling .to(...) anywhere — explicitly removed all such lines.✅

  • I'm using quantization_config=BitsAndBytesConfig(...) with load_in_4bit=True.✅

  • I removed device_map="auto" as per the transformers GitHub issue✅

  • I'm calling .cuda() only once on the model after .from_pretrained(...), as suggested ✅

  • Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set ✅

  • The system detects CUDA correctly: torch.cuda.is_available() is True

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything ✅

Here’s the relevant part of the code that triggers the error:

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)

Yet I still get the same ValueError.

I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. 
Please use the model as it is...

What I’ve Done So Far:

  • I'm not calling .to(...) anywhere — explicitly removed all such lines.✅

  • I'm using quantization_config=BitsAndBytesConfig(...) with load_in_4bit=True.✅

  • I removed device_map="auto" as per the transformers GitHub issue✅

  • I'm calling .cuda() only once on the model after .from_pretrained(...), as suggested ✅

  • Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set ✅

  • The system detects CUDA correctly: torch.cuda.is_available() is True

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything ✅

Here’s the relevant part of the code that triggers the error:

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)

Yet I still get the same ValueError.

Share Improve this question edited 23 mins ago desertnaut 60.5k32 gold badges155 silver badges181 bronze badges asked yesterday DalmoudaDalmouda 1 New contributor Dalmouda is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Add a comment  | 

1 Answer 1

Reset to default 0

The ValueError error persists because calling .cuda() on a 4-bit quantized model is not allowed with the BitsAndBytes integration in Hugging Face Transformers. When you use load_in_4bit=True in BitsAndBytesConfig, the model is automatically placed on the GPU (if available) during loading, and subsequent calls to .cuda() or .to() are unsupported and will raise this error.

Since you've already removed device_map="auto" and confirmed CUDA is detected (torch.cuda.is_available() == True), the issue lies in the .cuda() call after from_pretrained(). For 4-bit models, you should avoid manually moving the model to the GPU since BitsAndBytes handles this internally.

Update your code to remove the .cuda() call entirely, and you're good then.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论