Abstract
Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.
| Original language | English |
|---|---|
| Pages (from-to) | 629-641 |
| Number of pages | 13 |
| Journal | Frontiers of Computer Science |
| Volume | 8 |
| Issue number | 4 |
| DOIs | |
| State | Published - Aug 2014 |
| Externally published | Yes |
Keywords
- Chinese named entity
- named entity recognition
- parallel corpora
- training data generating
Fingerprint
Dive into the research topics of 'Generating Chinese named entity data from parallel corpora'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver