machine learning - How to Identify Similar Code Parts Using CodeBERT Embeddings?

I'm using CodeBERT to compare how similar two pieces of code are. For example:

# Code 1
def calculate_area(radius):
return 3.14 * radius * radius

# Code 2
def compute_circle_area(r):
return 3.14159 * r * r

CodeBERT creates "embeddings," which are like detailed descriptions of the code as numbers. I then compare these numerical descriptions to see how similar the codes are. This works well for telling me how much the codes are alike.

However, I can't tell which parts of the code CodeBERT thinks are similar. Because the "embeddings" are complex, I can't easily see what CodeBERT is focusing on. Comparing the code word-by-word doesn't work here.

My question is: How can I figure out which specific parts of two code snippets CodeBERT considers similar, beyond just getting a general similarity score?

I tried simple diff methods but that defeats the purpose of purely using CodeBERT. I want to know if it's possible using CodeBERT alone.

I'm using CodeBERT to compare how similar two pieces of code are. For example:

# Code 1
def calculate_area(radius):
return 3.14 * radius * radius

# Code 2
def compute_circle_area(r):
return 3.14159 * r * r

My question is: How can I figure out which specific parts of two code snippets CodeBERT considers similar, beyond just getting a general similarity score?

I tried simple diff methods but that defeats the purpose of purely using CodeBERT. I want to know if it's possible using CodeBERT alone.

Share Improve this question edited Mar 22 at 8:33 Sandipan Dey 23.3k4 gold badges57 silver badges71 bronze badges asked Mar 20 at 14:30 Nep 211 silver badge3 bronze badges

Please add your codebert code to retrieve the embeddings. – cronoik Commented Mar 22 at 16:27

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Using vanilla BERT, we can use bertviz's neuron view that visualizes the intermediate representations that are used to compute attention.

from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model_type = 'bert'
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=True)
show(model, model_type, tokenizer, \
            "def calculate_area(radius): return 3.14 * radius * radius", \
            "def compute_circle_area(r):return 3.14159 * r * r", \
            layer=4, head=3)

It outputs above interactive visualizer, where you can choose to the layer / attention head to view. The width of lines are proportional to attention weights.

You can try to make it work with codebert.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

machine learning - How to Identify Similar Code Parts Using CodeBERT Embeddings? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)