TY - GEN
T1 - An improved unknown word recognition model based on multi-knowledge source method
AU - Jiang, Wei
AU - Guan, Yi
AU - Wang, Xiao Long
PY - 2006
Y1 - 2006
N2 - Unknown word recognition (UWR) is a difficult and foundational task in lexical processing and content-based understanding. And it can improve many text-based processing applications, such as Information Extraction, Question Answer system, Electronic Meeting System. However the unified dealing approach is difficult to exploit more domain knowledge features, so the performance cannot be further improved easily, since UWR has been proved to be NP-hard problem. This paper presents a novel method for UWR task, which divides the UWR into several hard sub-tasks that usually encountering different difficulties, accordingly, several language models are adopted to solve the special sub-tasks, so as to exert the ability of each model in addressing special problems. Firstly, a class-based trigram is used in basic word segmentation, aided with absolute smoothing algorithm to overcome data sparseness. And Maximum Entropy Model (ME) is used to recognize Named Entity. New word detection adopts variance and Conditional Random Fields algorithm. Secondly, Multi-Knowledge features are effectively extracted and utilized in whole processing. Our system participated in the Second International Chinese Word Segmentation Bakeoff (SIGHAN2005), and got the overall performance 97.2% F-measure in MSRA open test.
AB - Unknown word recognition (UWR) is a difficult and foundational task in lexical processing and content-based understanding. And it can improve many text-based processing applications, such as Information Extraction, Question Answer system, Electronic Meeting System. However the unified dealing approach is difficult to exploit more domain knowledge features, so the performance cannot be further improved easily, since UWR has been proved to be NP-hard problem. This paper presents a novel method for UWR task, which divides the UWR into several hard sub-tasks that usually encountering different difficulties, accordingly, several language models are adopted to solve the special sub-tasks, so as to exert the ability of each model in addressing special problems. Firstly, a class-based trigram is used in basic word segmentation, aided with absolute smoothing algorithm to overcome data sparseness. And Maximum Entropy Model (ME) is used to recognize Named Entity. New word detection adopts variance and Conditional Random Fields algorithm. Secondly, Multi-Knowledge features are effectively extracted and utilized in whole processing. Our system participated in the Second International Chinese Word Segmentation Bakeoff (SIGHAN2005), and got the overall performance 97.2% F-measure in MSRA open test.
KW - Conditional random fields
KW - Maximum entropy model
KW - Out-of-vocabulary word recognition
KW - Question answer system
KW - Unknown word recognition
UR - https://www.scopus.com/pages/publications/34547548717
U2 - 10.1109/ISDA.2006.253719
DO - 10.1109/ISDA.2006.253719
M3 - 会议稿件
AN - SCOPUS:34547548717
SN - 0769525288
SN - 9780769525280
T3 - Proceedings - ISDA 2006: Sixth International Conference on Intelligent Systems Design and Applications
SP - 825
EP - 832
BT - Proceedings - ISDA 2006
T2 - ISDA 2006: Sixth International Conference on Intelligent Systems Design and Applications
Y2 - 16 October 2006 through 18 October 2006
ER -