Skip to main navigation Skip to search Skip to main content

Breaking corpus bottleneck for context-aware neural machine translation with cross-task pre-training

  • Linqing Chen
  • , Junhui Li*
  • , Zhengxian Gong
  • , Boxing Chen
  • , Weihua Luo
  • , Min Zhang
  • , Guodong Zhou
  • *Corresponding author for this work
  • Soochow University
  • Alibaba Group Holding Ltd.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Context-aware neural machine translation (NMT) remains challenging due to the lack of large-scale document-level parallel dataset. To break the corpus bottleneck, in this paper we aim to improve context-aware NMT by taking the advantage of the availability of both large-scale sentence-level parallel dataset and source-side monolingual documents. To this end, we propose two pre-training tasks. One learns to translate a sentence from source language to target language on the sentence-level parallel dataset while the other learns to translate a document from deliberately noised to original on the monolingual documents. Importantly, the two pre-training tasks are jointly and simultaneously learned via the same model, thereafter fine-tuned on scale-limited parallel documents from both sentence-level and document-level perspectives. Experimental results on four translation tasks show that our approach significantly improves translation performance. One nice property of our approach is that the fine-tuned model can be used to translate both sentences and documents.

Original languageEnglish
Title of host publicationACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages2851-2861
Number of pages11
ISBN (Electronic)9781954085527
DOIs
StatePublished - 2021
Externally publishedYes
EventJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 - Virtual, Online
Duration: 1 Aug 20216 Aug 2021

Publication series

NameACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
Volume1

Conference

ConferenceJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
CityVirtual, Online
Period1/08/216/08/21

Fingerprint

Dive into the research topics of 'Breaking corpus bottleneck for context-aware neural machine translation with cross-task pre-training'. Together they form a unique fingerprint.

Cite this