TY - GEN
T1 - TransGEC
T2 - Findings of the Association for Computational Linguistics, ACL 2023
AU - Fang, Tao
AU - Liu, Xuebo
AU - Wong, Derek F.
AU - Zhan, Runzhe
AU - Ding, Liang
AU - Chao, Lidia S.
AU - Tao, Dacheng
AU - Zhang, Min
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at https://github.com/NLP2CT/TransGEC.
AB - Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at https://github.com/NLP2CT/TransGEC.
UR - https://www.scopus.com/pages/publications/85174387119
U2 - 10.18653/v1/2023.findings-acl.223
DO - 10.18653/v1/2023.findings-acl.223
M3 - 会议稿件
AN - SCOPUS:85174387119
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 3614
EP - 3633
BT - Findings of the Association for Computational Linguistics, ACL 2023
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -