Skip to main navigation Skip to search Skip to main content

Experiments on the use of corpus-based word BI-gram in Chinese word segmentation

  • Ruifeng Xu*
  • , Daniel Yeung
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

The first step of Chinese language processing is to segment a Chinese sentence into a sequence of words due to the fact that there is no original separation between adjacent words. An efficient corpus-based statistical method is adopted here to address such a problem. In this paper, some word BI-gram statistical measures derived from corpus are employed to remove the segmentation ambiguities. To segment a Chinese sentence, a bidirectional maximum matching method is firstly used to do pre-matching in order to get segmentation candidates and locate possible ambiguities. The statistical measures based on word BI-gram information and word frequency will be used to construct a discriminate function, which is applied to ambiguity strings in order to get an utmost correct segmentation. Experimental results are analyzed to describe the features and limitations of this approach, and preliminary results indicate that our approach is compared favorably to other existing techniques.

Original languageEnglish
Pages (from-to)4222-4227
Number of pages6
JournalProceedings of the IEEE International Conference on Systems, Man and Cybernetics
Volume5
StatePublished - 1998
Externally publishedYes
EventProceedings of the 1998 IEEE International Conference on Systems, Man, and Cybernetics. Part 3 (of 5) - San Diego, CA, USA
Duration: 11 Oct 199814 Oct 1998

Fingerprint

Dive into the research topics of 'Experiments on the use of corpus-based word BI-gram in Chinese word segmentation'. Together they form a unique fingerprint.

Cite this