I'm encountering an issue when trying to load a saved model state dictionary into my TransformerDecoderModel. The error I receive is:
RuntimeError: Error(s) in loading state_dict for TransformerDecoderModel:
size mismatch for embed.weight: copying a param with shape torch.Size([10000, 128]) from checkpoint, the shape in current model is torch.Size([6313, 128]).
size mismatch for fc.weight: copying a param with shape torch.Size([10000, 128]) from checkpoint, the shape in current model is torch.Size([6313, 128]).
size mismatch for fc.bias: copying a param with shape torch.Size([10000]) from checkpoint, the shape in current model is torch.Size([6313]).
This happens because there's a discrepancy between the vocab_size used during training and the one defined at the time of loading the model.
Here are the relevant parts of my code:
Training: The model was trained with a vocab_size of 10000. Loading: Upon loading, the model is instantiated with a vocab_size that seems to be different, specifically 6313. The vocabulary size difference is likely due to changes in the tokenizer after training. How can I resolve this issue? Is there a way to ensure the vocab_size remains consistent or adjust the loaded weights to match the new vocab_size?
Any help would be greatly appreciated!
Code Snippet (代码片段):
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import json
from bpe_tokenizer import BpeTokenizer
import os
class TransformerDecoderModel(nn.Module):
def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers):
super(TransformerDecoderModel, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.positional_encoding = PositionalEncoding(embed_size)
decoder_layer = nn.TransformerDecoderLayer(d_model=embed_size, nhead=num_heads, dim_feedforward=hidden_dim)
self.transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
self.fc = nn.Linear(embed_size, vocab_size)
def forward(self, src, tgt):
# Embedding and positional encoding for both src and tgt
src = self.embed(src) * torch.sqrt(torch.tensor(src.size(-1)).float())
tgt = self.embed(tgt) * torch.sqrt(torch.tensor(tgt.size(-1)).float())
src = self.positional_encoding(src)
tgt = self.positional_encoding(tgt)
out = self.transformer_decoder(tgt, src)
out = self.fc(out)
return out
def train_model(data_path, tokenizer_path, model_path, vocab_size, min_freq, epochs=1, batch_size=2, grad_accum_steps=16):
if os.path.exists(tokenizer_path):
tokenizer = BpeTokenizer(tokenizer_path)
print("Loaded existing tokenizer.")
else:
tokenizer = BpeTokenizer()
tokenizer.train([data_path], vocab_size, min_freq)
tokenizer.save(tokenizer_path)
print("Trained and saved tokenizer.")
with open(data_path, 'r', encoding='utf-8') as f:
data = json.load(f)['data']
dataset = ChatDataset(data, tokenizer)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
device = torch.device('cpu')
model = TransformerDecoderModel(vocab_size, embed_size=128, num_heads=2, hidden_dim=256, num_layers=2).to(device)
criterion = nn.CrossEntropyLoss(ignore_index=-100)
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
model.train()
total_loss = 0
for i, batch in enumerate(dataloader):
input_ids = batch['input_ids'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, input_ids) # Assuming src and tgt are the same here
loss = criterion(outputs.view(-1, vocab_size), labels.view(-1))
loss = loss / grad_accum_steps
loss.backward()
if (i + 1) % grad_accum_steps == 0:
optimizer.step()
optimizer.zero_grad()
total_loss += loss.item() * grad_accum_steps
print(f'Epoch {epoch + 1}, Loss: {total_loss / len(dataloader)}')
torch.save(model.state_dict(), model_path)
def load_model(model_path, tokenizer_path):
tokenizer = BpeTokenizer(tokenizer_path)
vocab_size = len(tokenizer)
embed_size = 128
num_heads = 2
hidden_dim = 256
num_layers = 2
model = TransformerDecoderModel(vocab_size=vocab_size, embed_size=embed_size, num_heads=num_heads, hidden_dim=hidden_dim, num_layers=num_layers)
checkpoint = torch.load(model_path, map_location=torch.device('cpu'), weights_only=True)
model.load_state_dict(checkpoint, strict=False)
return model, tokenizer
if __name__ == "__main__":
data_path = 'train_data.json'
tokenizer_path = 'tokenizer.json'
model_path = 'chat_model.pth'
vocab_size = 10000
min_freq = 2
train_model(data_path, tokenizer_path, model_path, vocab_size, min_freq)
# Attempt to load the trained model
try:
model, tokenizer = load_model(model_path, tokenizer_path)
except RuntimeError as e:
print(e)
Environment (环境): Python version: 3.10.11 PyTorch version: 2.5.1
What I tried I aimed to resolve the size mismatch error by ensuring that the model could load despite differences in the vocabulary size. To achieve this, I modified the load_model function with the intention of making the loading process more flexible.
Non-Strict State Dict Loading: Initially, I encountered the size mismatch error when trying to load the state dictionary using model.load_state_dict(checkpoint). To address this, I modified the line to model.load_state_dict(checkpoint, strict=False) in hopes that it would allow the model to ignore the mismatched parameters (embed.weight, fc.weight, and fc.bias) and load the rest of the weights successfully. However, this did not solve the problem as these layers are critical for the model's operation and depend on the correct vocabulary size. Code Optimization: Additionally, I reviewed my code to ensure there were no other potential issues causing the discrepancy. I verified that the tokenizer used for training and loading is the same and that the vocab_size parameter is consistently set to 10000 during both stages. What I was expecting My expectation was that by setting strict=False, the model would be able to load the majority of its weights correctly, thereby allowing me to run the model without encountering the size mismatch error. Despite knowing that this approach wouldn't fix the root cause of the vocabulary size difference, I hoped it would at least provide a workaround that enables the model to operate normally.
However, even after these changes, the model still fails to load properly due to the critical role of the embedding and output layers in relation to the vocabulary size. Therefore, I need guidance on how to ensure that the vocab_size remains consistent between training and loading or how to adjust the loaded weights to match the new vocab_size.