Skip to main navigation Skip to search Skip to main content

Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code

  • Yujia Chen
  • , Cuiyun Gao*
  • , Zezhou Yang
  • , Hongyu Zhang
  • , Qing Liao
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • Chongqing University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the field of code intelligence, effectively modeling long-range code poses a significant challenge. Existing pre-trained language models (PLMs) such as UniXcoder have achieved remarkable success, but they still face difficulties with long code inputs. This is mainly due to their limited capacity to maintain contextual continuity and memorize the key information over long-range code. To alleviate the difficulties, we propose EXPO, a framework for EXtending Pre-trained language models for lOng-range code. EXPO incorporates two innovative memory mechanisms we propose in this paper: Bridge Memory and Hint Memory. Bridge Memory uses a tagging mechanism to connect disparate snippets of long-range code, helping the model maintain contextual coherence. Hint Memory focuses on crucial code elements throughout the global context, such as package imports, by integrating a NN attention layer to adaptively select the relevant code elements. This dual-memory approach bridges the gap between understanding local code snippets and maintaining global code coherence, thereby enhancing the model's overall comprehension of long code sequences. We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder and two code intelligence tasks including API recommendation and vulnerability detection. Experimental results demonstrate that EXPO significantly improves the pre-training language models.

Original languageEnglish
Title of host publicationISSTA 2024 - Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis
EditorsMaria Christakis, Michael Pradel
PublisherAssociation for Computing Machinery, Inc
Pages274-286
Number of pages13
ISBN (Electronic)9798400706127
DOIs
StatePublished - 11 Sep 2024
Externally publishedYes
Event33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024 - Vienna, Austria
Duration: 16 Sep 202420 Sep 2024

Publication series

NameISSTA 2024 - Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

Conference

Conference33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024
Country/TerritoryAustria
CityVienna
Period16/09/2420/09/24

Keywords

  • API recommendation
  • Pre-trained language model
  • code representation
  • long-range code
  • vulnerability detection

Fingerprint

Dive into the research topics of 'Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code'. Together they form a unique fingerprint.

Cite this