Skip to main navigation Skip to search Skip to main content

Micro blogs Oriented Word Segmentation System

  • Harbin Institute of Technology
  • Huazhong University of Science and Technology

Research output: Contribution to conferencePaperpeer-review

Abstract

We present a Chinese word segmentation system submitted to the first task on CLP 2012 back-offs. Our segmenter is built using a conditional random field sequence model. We set the combination of a few annotated micro blogs and People Daily corpus as the training data. We encode special words detected by rules and information extracted from unlabeled data into features. These features are used to improve our model’s performance. We also derive a micro blog specified lexicon from auto-analyzed data and use lexicon related features to assist the model. When testing on the sample data of this task, these features result in 1.8% improvement over the baseline model. Finally, our model achieves F-score of 94.07% on the bake-off’s test set.

Original languageEnglish
Pages85-89
Number of pages5
StatePublished - 2012
Event2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, CLP 2012 - Tianjin, China
Duration: 20 Dec 201221 Dec 2012

Conference

Conference2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, CLP 2012
Country/TerritoryChina
CityTianjin
Period20/12/1221/12/12

Fingerprint

Dive into the research topics of 'Micro blogs Oriented Word Segmentation System'. Together they form a unique fingerprint.

Cite this