TY - GEN
T1 - An automatic Chinese collocation extraction algorithm based on lexical statistics
AU - Xu, Ruifeng
AU - Lu, Qin
AU - Li, Yin
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical knowledge. This system extracts collocations from manually segmented and tagged Chinese news corpus in three stages. First, the BI-directional BI-Gram statistical measures, including BI-directional strength and spread, and x2 test value, are employed to extract candidate two-word pairs. These candidate word pairs are then used to extract high frequency multi-word collocations from their context. In the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo collocations. In the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73% precision rate, a substantial improvement on the 61% achieved using an algorithm we developed earlier based on an improved version of the Smdja's 53% accurate Xtract system.
AB - This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical knowledge. This system extracts collocations from manually segmented and tagged Chinese news corpus in three stages. First, the BI-directional BI-Gram statistical measures, including BI-directional strength and spread, and x2 test value, are employed to extract candidate two-word pairs. These candidate word pairs are then used to extract high frequency multi-word collocations from their context. In the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo collocations. In the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73% precision rate, a substantial improvement on the 61% achieved using an algorithm we developed earlier based on an improved version of the Smdja's 53% accurate Xtract system.
KW - Chinese collocation
KW - Information extraction and statistical models
UR - https://www.scopus.com/pages/publications/84863860753
U2 - 10.1109/NLPKE.2003.1275923
DO - 10.1109/NLPKE.2003.1275923
M3 - 会议稿件
AN - SCOPUS:84863860753
T3 - NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings
SP - 321
EP - 326
BT - NLP-KE 2003 - 2003 International Conference on Natural Language Processing and Knowledge Engineering, Proceedings
A2 - Zong, Chengqing
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2003
Y2 - 26 October 2003 through 29 October 2003
ER -