Skip to main navigation Skip to search Skip to main content

TransGEC: Improving Grammatical Error Correction with Translationese

  • Tao Fang
  • , Xuebo Liu*
  • , Derek F. Wong*
  • , Runzhe Zhan
  • , Liang Ding
  • , Lidia S. Chao
  • , Dacheng Tao
  • , Min Zhang
  • *Corresponding author for this work
  • NLP
  • University of Macau
  • Harbin Institute of Technology Shenzhen
  • JD Explore Academy
  • The University of Sydney

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at https://github.com/NLP2CT/TransGEC.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics, ACL 2023
PublisherAssociation for Computational Linguistics (ACL)
Pages3614-3633
Number of pages20
ISBN (Electronic)9781959429623
DOIs
StatePublished - 2023
Externally publishedYes
EventFindings of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada
Duration: 9 Jul 202314 Jul 2023

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

ConferenceFindings of the Association for Computational Linguistics, ACL 2023
Country/TerritoryCanada
CityToronto
Period9/07/2314/07/23

Fingerprint

Dive into the research topics of 'TransGEC: Improving Grammatical Error Correction with Translationese'. Together they form a unique fingerprint.

Cite this