pytorch - Output inconsistency when using LLM batch inference compared to single input

I found single LLM input get different output logits when merging into a batch for inference.

Besides, I need to use inputs_embeds as model input.

My test LLM is "Qwen/Qwen2.5-1.5B-Instruct" and the test code is below.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# set model eval
model.eval()

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]  # 取第一个文本的 logits

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0),  # 添加批次维度
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0]  # 取第一个文本的 logits

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print(is_close)

I tried all the methods Deepseek suggested and then all faild, like setting attention masks, positions and so on.

I want the same output logits of a single input text as the ones extracted from batch output results.

I found single LLM input get different output logits when merging into a batch for inference.

Besides, I need to use inputs_embeds as model input.

My test LLM is "Qwen/Qwen2.5-1.5B-Instruct" and the test code is below.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# load model and tokenizezr
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# set model eval
model.eval()

# input texts
texts = ['a', 'b', 'c']

# tokenize
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)

# get inputs_embeds
with torch.no_grad():
    inputs_embeds = model.get_input_embeddings()(inputs.input_ids)

# get attention_mask and position_ids
attention_mask = inputs.attention_mask
position_ids = torch.arange(inputs.input_ids.shape[1], device=model.device).unsqueeze(0).expand(inputs.input_ids.shape[0], -1)

# batch
with torch.no_grad():
    output_batch = model(
        inputs_embeds=inputs_embeds,
        attention_mask=attention_mask,
        position_ids=position_ids
    ).logits[0]  # 取第一个文本的 logits

# single
with torch.no_grad():
    output_single = model(
        inputs_embeds=inputs_embeds[0].unsqueeze(0),  # 添加批次维度
        attention_mask=attention_mask[0].unsqueeze(0),
        position_ids=position_ids[0].unsqueeze(0)
    ).logits[0]  # 取第一个文本的 logits

# check consistency
is_close = torch.allclose(output_batch, output_single, atol=1e-5, rtol=1e-3)
print(is_close)

I tried all the methods Deepseek suggested and then all faild, like setting attention masks, positions and so on.

I want the same output logits of a single input text as the ones extracted from batch output results.

Share Improve this question asked Mar 18 at 22:58 史开源 1

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

This is caused by numerical issues. Running matmul operations on different input shapes yields small but nonzero differences. These errors are compounded over the model's layers, yielding large differences in the final output. The model in question uses bfloat16 weights, which is more sensitive to this type of error. As an example:

sizes = [256, 512, 1024, 1536, 2048]

for size in sizes:
    x = torch.randn(32, size, dtype=torch.bfloat16, device='cuda:0')
    layer = torch.nn.Linear(size, size, dtype=torch.bfloat16, device='cuda:0')
    y1 = layer(x[:1])
    y2 = layer(x)
    print(f"{size}, {torch.allclose(y1[0], y2[0])}")

The code above compares single vs batch inference of a single linear layer at difference sizes, all using bfloat16. If you run the code multiple times, you will notice the results are not stable. Here is an example output of running several iterations:

256, True
512, True
1024, False
1536, False
2048, False

256, True
512, True
1024, False
1536, True
2048, True

256, True
512, True
1024, True
1536, False
2048, False

The scale of the numerical error depends not just on the size of inputs, but the specific input/weight values.

For the model in question, you could reduce the error by changing the datatype to float32 via torch_dtype=torch.float32, but even then the error will still be nonzero.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pytorch - Output inconsistency when using LLM batch inference compared to single input - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)