nlp - How to Fine-Tune Projection Layer in CLIP Model Using LoRA?

I'm trying to fine-tune the projection layers in the CLIP model using LoRA.

I need help identifying the exact projection layers to modify for my fine-tuning and how I can apply LoRA to them.

Model loading:

import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Model structure when printed

CLIP(
  (visual): VisionTransformer()
  (transformer): Transformer()
  (token_embedding): Embedding(49408, 512)
  (ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)

I need help identifying the exact projection layers to modify for my fine-tuning and how I can apply LoRA to them.

I'm trying to fine-tune the projection layers in the CLIP model using LoRA.

I need help identifying the exact projection layers to modify for my fine-tuning and how I can apply LoRA to them.

Model loading:

import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Model structure when printed

CLIP(
  (visual): VisionTransformer()
  (transformer): Transformer()
  (token_embedding): Embedding(49408, 512)
  (ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)

I need help identifying the exact projection layers to modify for my fine-tuning and how I can apply LoRA to them.

Share Improve this question edited Mar 26 at 13:39 cronoik 19.6k4 gold badges51 silver badges90 bronze badges asked Mar 17 at 7:37 Fadela 211 silver badge3 bronze badges

Welcome to SO. How are you loading the model? Via original openai code? Keep in mind that the projection layers are just linear layers, which means you won't benefit (much) from classic lora. – cronoik Commented Mar 22 at 16:37
@cronoik Thank you for your comment! I am indeed using clip.load("ViT-B/32", device=device) from the standard clip library. yes this is only an experiment for me to do.. I'm still a bit lost on where to apply LoRA within the model.. I've tried looking for layers with "proj" in their name, but I'm not sure if those are the correct projection layers for LoRA. Could you clarify which kind of layers are typically considered "projection layers" in CLIP for LoRA fine-tuning? Maybe knowing the layer type or position in the network flow would help me identify them accurately. – Fadela Commented Mar 26 at 0:35

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

You will not see the projection layers when you print the architecture with print(model), because the projection layers are initialized with nn.Parameter() in the openai CLIP repo (unlike the huggingface implementation which uses linear layers). The code references can be found:

visual projection layer: code
text projection layer: code

You can still print the layers initialized with nn.Parameter by:

for name, param in model.named_parameters():
    print(f'{name}: {param.shape}')

Output:

text_projection: torch.Size([512, 512])
visual.proj: torch.Size([768, 512])
...

The issue you face now is that nn.Parameter is not supported by peft/LoRA (explanation). You could now either modify the Clip code (using nn.Linear instead of nn.Parameter) or use the CLIP implementation of huggingface (mind the different layer names):

from transformers import CLIPModel
from peft import LoraConfig, get_peft_model

transformers_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

config = LoraConfig(
    target_modules=["visual_projection", "text_projection"],
)

peft_model = get_peft_model(transformers_model, config)
peft_model.print_trainable_parameters()

Output:

trainable params: 18,432 || all params: 151,295,745 || trainable%: 0.0122

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

nlp - How to Fine-Tune Projection Layer in CLIP Model Using LoRA? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)