TY - GEN
T1 - Towards Better Chinese Spelling Check for Search Engines
T2 - 17th ACM International Conference on Web Search and Data Mining, WSDM 2024
AU - Wang, Yue
AU - Zheng, Zilong
AU - Tang, Zecheng
AU - Li, Juntao
AU - Liu, Zhihui
AU - Chen, Kunlong
AU - Chang, Jinxiong
AU - Zhang, Qishen
AU - Liu, Zhongyi
AU - Zhang, Min
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/3/4
Y1 - 2024/3/4
N2 - Misspellings in search engine queries may prevent search engines from returning accurate results. For Chinese mobile search engines, due to the different input methods (e.g., hand-written and T9 input methods), more types of misspellings exist, making this problem more challenging. As an essential module of search engines, Chinese Spelling Check∼(CSC) models aim to detect and correct misspelled Chinese characters from user-issued queries. Despite the great value of CSC to the search engine, there is no CSC benchmark collected from real-world search engine queries. To fill this blank, we construct and release the Alipay Search Engine Query (AlipaySEQ) spelling check dataset. To the best of our knowledge, AlipaySEQ is the first Chinese Spelling Check dataset collected from the real-world scenario of Chinese mobile search engines. It consists of 15,522 high-quality human annotated and 1,175,151 automatically generated samples. To demonstrate the unique challenges of AlipaySEQ in the era of Large Language Models∼(LLMs), we conduct a thorough study to analyze the difference between AlipaySEQ and existing SIGHAN benchmarks and compare the performance of various baselines, including existing task-specific methods and LLMs. We observe that all baselines fail to perform satisfactorily due to the over-correction problem. Especially, LLMs exhibit below-par performance on AlipaySEQ, which is rather surprising. Therefore, to alleviate the over-correction problem, we introduce a model-Agnostic CSC Self-Refine Framework (SRF) to construct a strong baseline. Comprehensive experiments demonstrate that our proposed SRF, though more effective against existing models on both the AlipaySEQ and SIGHAN15, is still far from achieving satisfactory performance on our real-world dataset. With the newly collected real-world dataset and strong baseline, we hope more progress can be achieved on such a challenging and valuable task.
AB - Misspellings in search engine queries may prevent search engines from returning accurate results. For Chinese mobile search engines, due to the different input methods (e.g., hand-written and T9 input methods), more types of misspellings exist, making this problem more challenging. As an essential module of search engines, Chinese Spelling Check∼(CSC) models aim to detect and correct misspelled Chinese characters from user-issued queries. Despite the great value of CSC to the search engine, there is no CSC benchmark collected from real-world search engine queries. To fill this blank, we construct and release the Alipay Search Engine Query (AlipaySEQ) spelling check dataset. To the best of our knowledge, AlipaySEQ is the first Chinese Spelling Check dataset collected from the real-world scenario of Chinese mobile search engines. It consists of 15,522 high-quality human annotated and 1,175,151 automatically generated samples. To demonstrate the unique challenges of AlipaySEQ in the era of Large Language Models∼(LLMs), we conduct a thorough study to analyze the difference between AlipaySEQ and existing SIGHAN benchmarks and compare the performance of various baselines, including existing task-specific methods and LLMs. We observe that all baselines fail to perform satisfactorily due to the over-correction problem. Especially, LLMs exhibit below-par performance on AlipaySEQ, which is rather surprising. Therefore, to alleviate the over-correction problem, we introduce a model-Agnostic CSC Self-Refine Framework (SRF) to construct a strong baseline. Comprehensive experiments demonstrate that our proposed SRF, though more effective against existing models on both the AlipaySEQ and SIGHAN15, is still far from achieving satisfactory performance on our real-world dataset. With the newly collected real-world dataset and strong baseline, we hope more progress can be achieved on such a challenging and valuable task.
KW - chinese mobile search engine
KW - chinese spelling check
KW - datasets
UR - https://www.scopus.com/pages/publications/85191739033
U2 - 10.1145/3616855.3635847
DO - 10.1145/3616855.3635847
M3 - 会议稿件
AN - SCOPUS:85191739033
T3 - WSDM 2024 - Proceedings of the 17th ACM International Conference on Web Search and Data Mining
SP - 769
EP - 778
BT - WSDM 2024 - Proceedings of the 17th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery, Inc
Y2 - 4 March 2024 through 8 March 2024
ER -