add sharma ICPC 2022 papers

rishab-32 · mallamanis · commit bb2e110f7a21 · 2023-03-26T20:00:46.000+01:00
diff --git a/_publications/sharma2022an.markdown b/_publications/sharma2022an.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "An Exploratory Study on Code Attention in BERT"
+authors: Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard, David Lo
+conference: ICPC
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2204.10200"}
+   - {name: "code", url: "https://github.com/fardfh-lab/Code-Attention-BERT"}
+tags: ["Transformer", "representation", "language model", "interpretability", "pretraining", "clone"]
+---
+Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21--24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.
+
diff --git a/_publications/sharma2022lamner.markdown b/_publications/sharma2022lamner.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition"
+authors: Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard
+conference: ICPC
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2204.09654"}
+   - {name: "code", url: "https://github.com/fardfh-lab/LAMNER"}
+tags: ["summarization", "documentation", "language model", "types", "representation"]
+---
+Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER’s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments.
+