I'm using CodeBERT to compare how similar two pieces of code are. For example:
# Code 1
def calculate_area(radius):
return 3.14 * radius * radius
# Code 2
def compute_circle_area(r):
return 3.14159 * r * r
CodeBERT creates "embeddings," which are like detailed descriptions of the code as numbers. I then compare these numerical descriptions to see how similar the codes are. This works well for telling me how much the codes are alike.
However, I can't tell which parts of the code CodeBERT thinks are similar. Because the "embeddings" are complex, I can't easily see what CodeBERT is focusing on. Comparing the code word-by-word doesn't work here.
My question is: How can I figure out which specific parts of two code snippets CodeBERT considers similar, beyond just getting a general similarity score?
I tried simple diff methods but that defeats the purpose of purely using CodeBERT. I want to know if it's possible using CodeBERT alone.
I'm using CodeBERT to compare how similar two pieces of code are. For example:
# Code 1
def calculate_area(radius):
return 3.14 * radius * radius
# Code 2
def compute_circle_area(r):
return 3.14159 * r * r
CodeBERT creates "embeddings," which are like detailed descriptions of the code as numbers. I then compare these numerical descriptions to see how similar the codes are. This works well for telling me how much the codes are alike.
However, I can't tell which parts of the code CodeBERT thinks are similar. Because the "embeddings" are complex, I can't easily see what CodeBERT is focusing on. Comparing the code word-by-word doesn't work here.
My question is: How can I figure out which specific parts of two code snippets CodeBERT considers similar, beyond just getting a general similarity score?
I tried simple diff methods but that defeats the purpose of purely using CodeBERT. I want to know if it's possible using CodeBERT alone.
Share Improve this question edited Mar 22 at 8:33 Sandipan Dey 23.3k4 gold badges57 silver badges71 bronze badges asked Mar 20 at 14:30 NepNep 211 silver badge3 bronze badges 1- Please add your codebert code to retrieve the embeddings. – cronoik Commented Mar 22 at 16:27
1 Answer
Reset to default 1Using vanilla BERT, we can use bertviz
's neuron view that visualizes the intermediate representations that are used to compute attention.
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show
model_type = 'bert'
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=True)
show(model, model_type, tokenizer, \
"def calculate_area(radius): return 3.14 * radius * radius", \
"def compute_circle_area(r):return 3.14159 * r * r", \
layer=4, head=3)
It outputs above interactive visualizer, where you can choose to the layer / attention head to view. The width of lines are proportional to attention weights.
You can try to make it work with codebert.