Skip to main navigation Skip to search Skip to main content

REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

  • Chaozheng Wang
  • , Zongjie Li
  • , Yun Pena
  • , Shuzheng Gao
  • , Sirong Chen
  • , Shuai Wang
  • , Cuiyun Gao*
  • , Michael R. Lyu
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Hong Kong University of Science and Technology
  • Chinese University of Hong Kong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Software plays a crucial role in our daily lives, and therefore the quality and security of software systems have become increasingly important. However, vulnerabilities in software still pose a significant threat, as they can have serious consequences. Recent advances in automated program repair have sought to automatically detect and fix bugs using data-driven techniques. Sophisticated deep learning methods have been applied to this area and have achieved promising results. However, existing benchmarks for training and evaluating these techniques remain limited, as they tend to focus on a single programming language and have relatively small datasets. Moreover, many benchmarks tend to be outdated and lack diversity, focusing on a specific codebase. Worse still, the quality of bug explanations in existing datasets is low, as they typically use imprecise and uninformative commit messages as explanations. To address these issues, we propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We focus on vulnerabilities since they are exploitable and have serious consequences. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Furthermore, we propose a neural language model-based approach to generate high-quality vulnerability explanations, which is key to producing informative fix messages. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches (including 236 CWE) across 7 programming languages with detailed related information, which is superior to existing benchmarks in scale, coverage, and quality. Evaluations by human experts further confirm that our framework produces high-quality vulnerability explanations.

Original languageEnglish
Title of host publicationProceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1952-1962
Number of pages11
ISBN (Electronic)9798350329964
DOIs
StatePublished - 2023
Externally publishedYes
Event38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023 - Echternach, Luxembourg
Duration: 11 Sep 202315 Sep 2023

Publication series

NameProceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023

Conference

Conference38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023
Country/TerritoryLuxembourg
CityEchternach
Period11/09/2315/09/23

Keywords

  • Bug fix
  • Data collection
  • Vulnerability

Fingerprint

Dive into the research topics of 'REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes'. Together they form a unique fingerprint.

Cite this