最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pytorch - Output inconsistency when using LLM batch inference compared to single input - Stack Overflow

programmeradmin6浏览0评论

I found single LLM input get different output logits when merging into a batch for inference.

Besides, I need to use inputs_embeds as model input.

My test LLM is "Qwen/Qwen2.5-1.5B-Instruct" and the test code is below.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# set model eval
model.eval()

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]  # 取第一个文本的 logits

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0),  # 添加批次维度
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0]  # 取第一个文本的 logits

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print(is_close)

I tried all the methods Deepseek suggested and then all faild, like setting attention masks, positions and so on.

I want the same output logits of a single input text as the ones extracted from batch output results.

I found single LLM input get different output logits when merging into a batch for inference.

Besides, I need to use inputs_embeds as model input.

My test LLM is "Qwen/Qwen2.5-1.5B-Instruct" and the test code is below.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# set model eval
model.eval()

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]  # 取第一个文本的 logits

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0),  # 添加批次维度
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0]  # 取第一个文本的 logits

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print(is_close)

I tried all the methods Deepseek suggested and then all faild, like setting attention masks, positions and so on.

I want the same output logits of a single input text as the ones extracted from batch output results.

Share Improve this question asked Mar 18 at 22:58 史开源史开源 1
Add a comment  | 

1 Answer 1

Reset to default 0

This is caused by numerical issues. Running matmul operations on different input shapes yields small but nonzero differences. These errors are compounded over the model's layers, yielding large differences in the final output. The model in question uses bfloat16 weights, which is more sensitive to this type of error. As an example:

sizes = [256, 512, 1024, 1536, 2048]

for size in sizes:
    x = torch.randn(32, size, dtype=torch.bfloat16, device='cuda:0')
    layer = torch.nn.Linear(size, size, dtype=torch.bfloat16, device='cuda:0')
    y1 = layer(x[:1])
    y2 = layer(x)
    print(f"{size}, {torch.allclose(y1[0], y2[0])}")

The code above compares single vs batch inference of a single linear layer at difference sizes, all using bfloat16. If you run the code multiple times, you will notice the results are not stable. Here is an example output of running several iterations:

256, True
512, True
1024, False
1536, False
2048, False

256, True
512, True
1024, False
1536, True
2048, True

256, True
512, True
1024, True
1536, False
2048, False

The scale of the numerical error depends not just on the size of inputs, but the specific input/weight values.

For the model in question, you could reduce the error by changing the datatype to float32 via torch_dtype=torch.float32, but even then the error will still be nonzero.

发布评论

评论列表(0)

  1. 暂无评论