你的位置：首页>programmer>deep learning - MultiModal Cross attention - Stack Overflow

deep learning - MultiModal Cross attention - Stack Overflow

programmeradmin2025-03-194浏览0评论

I am dealing with two embeddings, text and image both are last_hidden_state of transfomer models (bert and vit), so the shapes are (batch, seq, emd_dim). I want to feed text information to image using a cross attention mechanism and I was wondering whether this line of code will give me what I need:

cross_attention = nn.MultiheadAttention(embed_dim=768, num_heads=12, dropout=0.1)
attn_output, attn_output_weights = cross_attention(text_last, img_last, img_last)

I tried the provided code but I am not sure whether it is the right approach

与本文相关的文章

deep learning - MultiModal Cross attention - Stack Overflow

评论列表(0)

暂无评论

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

与本文相关的文章

评论列表(0)