Skip to main navigation Skip to search Skip to main content

Automatic acquisition of large-scale academic bilingual parallel corpus from the web

  • Yong Han*
  • , Yu Li
  • , Xiaoning He
  • , Muyun Yang
  • , Guohua Lei
  • *Corresponding author for this work
  • Heilongjiang Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we describe a system which automatically acquires large-scale Chinese-English bilingual parallel corpus from China Journals Full-text Database (CJFD), a component of China National Knowledge Infrastructure (CNKI). The system gets large amount of parallel texts with domain information from the existing structured bilingual texts in CJFD, such as Chinese and English abstracts and titles of academic articles. The acquired Chinese-English parallel corpus is by several orders of magnitudes larger than similar corpus we have known before. In addition, this system collects a large amount of bilingual terms which can directly apply to lexical acquisition.

Original languageEnglish
Title of host publication2009 International Conference on Asian Language Processing
Subtitle of host publicationRecent Advances in Asian Language Processing, IALP 2009
Pages318-321
Number of pages4
DOIs
StatePublished - 2009
Event2009 International Conference on Asian Language Processing: Recent Advances in Asian Language Processing, IALP 2009 - Singapore, Singapore
Duration: 7 Dec 20099 Dec 2009

Publication series

Name2009 International Conference on Asian Language Processing: Recent Advances in Asian Language Processing, IALP 2009

Conference

Conference2009 International Conference on Asian Language Processing: Recent Advances in Asian Language Processing, IALP 2009
Country/TerritorySingapore
CitySingapore
Period7/12/099/12/09

Keywords

  • Bilingual parallel corpora acquision
  • Bilingual term acquision
  • Data mining

Fingerprint

Dive into the research topics of 'Automatic acquisition of large-scale academic bilingual parallel corpus from the web'. Together they form a unique fingerprint.

Cite this