Skip to main navigation Skip to search Skip to main content

Improving name origin recognition with context features and unlabelled data

  • Vladimir Pervouchine*
  • , Min Zhang
  • , Ming Liu
  • , Haizhou Li
  • *Corresponding author for this work
  • Agency for Science, Technology and Research, Singapore

Research output: Contribution to conferencePaperpeer-review

Abstract

We demonstrate the use of context features, namely, names of places, and unlabelled data for the detection of personal name language of origin. While some early work used either rule-based methods or n-gram statistical models to determine the name language of origin, we use the discriminative classification maximum entropy model and view the task as a classification task. We perform bootstrapping of the learning using list of names out of context but with known origin and then using expectation-maximisation algorithm to further train the model on a large corpus of names of unknown origin but with context features. Using a relatively small unlabelled corpus we improve the accuracy of name origin recognition for names written in Chinese from 82.7% to 85.8%, a significant reduction in the error rate. The improvement in F-score for infrequent Japanese names is even greater: from 77.4% without context features to 82.8% with context features.

Original languageEnglish
Pages972-978
Number of pages7
StatePublished - 2010
Externally publishedYes
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: 23 Aug 201027 Aug 2010

Conference

Conference23rd International Conference on Computational Linguistics, Coling 2010
Country/TerritoryChina
CityBeijing
Period23/08/1027/08/10

Fingerprint

Dive into the research topics of 'Improving name origin recognition with context features and unlabelled data'. Together they form a unique fingerprint.

Cite this