Abstract
We present a Chinese word segmentation system submitted to the first task on CLP 2012 back-offs. Our segmenter is built using a conditional random field sequence model. We set the combination of a few annotated micro blogs and People Daily corpus as the training data. We encode special words detected by rules and information extracted from unlabeled data into features. These features are used to improve our model’s performance. We also derive a micro blog specified lexicon from auto-analyzed data and use lexicon related features to assist the model. When testing on the sample data of this task, these features result in 1.8% improvement over the baseline model. Finally, our model achieves F-score of 94.07% on the bake-off’s test set.
| Original language | English |
|---|---|
| Pages | 85-89 |
| Number of pages | 5 |
| State | Published - 2012 |
| Event | 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, CLP 2012 - Tianjin, China Duration: 20 Dec 2012 → 21 Dec 2012 |
Conference
| Conference | 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing, CLP 2012 |
|---|---|
| Country/Territory | China |
| City | Tianjin |
| Period | 20/12/12 → 21/12/12 |
Fingerprint
Dive into the research topics of 'Micro blogs Oriented Word Segmentation System'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver