Skip to main navigation Skip to search Skip to main content

An automatic Chinese collocation extraction algorithm based on lexical statistics

  • Hong Kong Polytechnic University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical knowledge. This system extracts collocations from manually segmented and tagged Chinese news corpus in three stages. First, the BI-directional BI-Gram statistical measures, including BI-directional strength and spread, and x2 test value, are employed to extract candidate two-word pairs. These candidate word pairs are then used to extract high frequency multi-word collocations from their context. In the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo collocations. In the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73% precision rate, a substantial improvement on the 61% achieved using an algorithm we developed earlier based on an improved version of the Smdja's 53% accurate Xtract system.

Original languageEnglish
Title of host publicationNLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings
EditorsChengqing Zong
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages321-326
Number of pages6
ISBN (Electronic)0780379020, 9780780379022
DOIs
StatePublished - 2003
Externally publishedYes
EventInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003 - Beijing, China
Duration: 26 Oct 200329 Oct 2003

Publication series

NameNLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings

Conference

ConferenceInternational Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003
Country/TerritoryChina
CityBeijing
Period26/10/0329/10/03

Keywords

  • Chinese collocation
  • Information extraction and statistical models

Fingerprint

Dive into the research topics of 'An automatic Chinese collocation extraction algorithm based on lexical statistics'. Together they form a unique fingerprint.

Cite this