We all know that LoRA is a low-rank adaptation method, which can be formulated as follows: x = W_0 * x + (A @ B) * x. I have two different code implementations of this. Are there any differences between them?
Code 1:
def forward(self, x):
x = x @ self.lora_A
x = x @ self.lora_B
x = self.scaling * x
return x
Code 2:
def forward(self, x):
x = x @ (self.lora_A @ self.lora_B)
x = self.scaling * x
return x
From a mathematical perspective, both seem equivalent. However, when I run both implementations on a toy dataset, I observe a very slight difference in their performance—Code 2 performs slightly better. Why this slight difference might occur? Is there any underlying computational or optimization nuance that could explain this?
I’m not completely sure if both implementations are correct. I often see Code 1 in GitHub repositories, but I’ve noticed that Code 2 performs slightly better. Why this might be the case?