Skip to main navigation Skip to search Skip to main content

Generating Chinese named entity data from parallel corpora

  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.

Original languageEnglish
Pages (from-to)629-641
Number of pages13
JournalFrontiers of Computer Science
Volume8
Issue number4
DOIs
StatePublished - Aug 2014
Externally publishedYes

Keywords

  • Chinese named entity
  • named entity recognition
  • parallel corpora
  • training data generating

Fingerprint

Dive into the research topics of 'Generating Chinese named entity data from parallel corpora'. Together they form a unique fingerprint.

Cite this