Skip to main navigation Skip to search Skip to main content

预训练增强的代码克隆检测技术

Translated title of the contribution: Clone Detection with Pre-training Enhanced Code Representation
  • Lin Shan Leng
  • , Shuang Liu*
  • , Cheng Lin Tian
  • , Shu Jie Dou
  • , Zan Wang
  • , Mei Shan Zhang
  • *Corresponding author for this work
  • Tianjin University

Research output: Contribution to journalArticlepeer-review

Abstract

Code clone detection is an important task in the software engineering community, it is particularly difficult to detect type-IV code clone, which have similar semantics but large syntax gap. Deep learning-based approaches have achieved promising performances on the detection of type-IV code clone, yet at the high-cost of using manually-annotated code clone pairs for supervision. This study proposes two simple and effective pretraining strategies to enhance the representation learning of code clone detection model based on deep learning, aiming to alleviate the requirement of the large-scale training dataset in supervised learning models. First, token embeddings models are pretrained with ngram subword enhancement, which helps the clone detection model to better represent out-of-vocabulary (OOV) tokens. Second, the function name prediction is adopted as an auxiliary task to pretrain clone detection model parameters from token to code fragments. With the two enhancement strategies, a model with more accurate code representation capability can be achieved, which is then used as the code representation model in clone detection and trained on the clone detection task with supervised learning. The experiments on the standard benchmark dataset BigCloneBench (BCB) and OJClone are conducedt, finding that the final model with only a very small number of training instances (i.e., 100 clones and 100 non-clones for BCB, 200 clones and 200 non-clones for OJClone) can give comparable performance than existing methods with over six million training instances.

Translated title of the contributionClone Detection with Pre-training Enhanced Code Representation
Original languageChinese (Traditional)
Pages (from-to)1758-1773
Number of pages16
JournalRuan Jian Xue Bao/Journal of Software
Volume33
Issue number5
DOIs
StatePublished - May 2022
Externally publishedYes

Fingerprint

Dive into the research topics of 'Clone Detection with Pre-training Enhanced Code Representation'. Together they form a unique fingerprint.

Cite this