Towards Better Chinese Spelling Check for Search Engines: A New Dataset and Strong Baseline

  • Yue Wang
  • , Zilong Zheng
  • , Zecheng Tang
  • , Juntao Li*
  • , Zhihui Liu
  • , Kunlong Chen
  • , Jinxiong Chang
  • , Qishen Zhang
  • , Zhongyi Liu
  • , Min Zhang
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Misspellings in search engine queries may prevent search engines from returning accurate results. For Chinese mobile search engines, due to the different input methods (e.g., hand-written and T9 input methods), more types of misspellings exist, making this problem more challenging. As an essential module of search engines, Chinese Spelling Check∼(CSC) models aim to detect and correct misspelled Chinese characters from user-issued queries. Despite the great value of CSC to the search engine, there is no CSC benchmark collected from real-world search engine queries. To fill this blank, we construct and release the Alipay Search Engine Query (AlipaySEQ) spelling check dataset. To the best of our knowledge, AlipaySEQ is the first Chinese Spelling Check dataset collected from the real-world scenario of Chinese mobile search engines. It consists of 15,522 high-quality human annotated and 1,175,151 automatically generated samples. To demonstrate the unique challenges of AlipaySEQ in the era of Large Language Models∼(LLMs), we conduct a thorough study to analyze the difference between AlipaySEQ and existing SIGHAN benchmarks and compare the performance of various baselines, including existing task-specific methods and LLMs. We observe that all baselines fail to perform satisfactorily due to the over-correction problem. Especially, LLMs exhibit below-par performance on AlipaySEQ, which is rather surprising. Therefore, to alleviate the over-correction problem, we introduce a model-Agnostic CSC Self-Refine Framework (SRF) to construct a strong baseline. Comprehensive experiments demonstrate that our proposed SRF, though more effective against existing models on both the AlipaySEQ and SIGHAN15, is still far from achieving satisfactory performance on our real-world dataset. With the newly collected real-world dataset and strong baseline, we hope more progress can be achieved on such a challenging and valuable task.

Original languageEnglish
Title of host publicationWSDM 2024 - Proceedings of the 17th ACM International Conference on Web Search and Data Mining
PublisherAssociation for Computing Machinery, Inc
Pages769-778
Number of pages10
ISBN (Electronic)9798400703713
DOIs
StatePublished - 4 Mar 2024
Externally publishedYes
Event17th ACM International Conference on Web Search and Data Mining, WSDM 2024 - Merida, Mexico
Duration: 4 Mar 20248 Mar 2024

Publication series

NameWSDM 2024 - Proceedings of the 17th ACM International Conference on Web Search and Data Mining

Conference

Conference17th ACM International Conference on Web Search and Data Mining, WSDM 2024
Country/TerritoryMexico
CityMerida
Period4/03/248/03/24

Keywords

  • chinese mobile search engine
  • chinese spelling check
  • datasets

Fingerprint

Dive into the research topics of 'Towards Better Chinese Spelling Check for Search Engines: A New Dataset and Strong Baseline'. Together they form a unique fingerprint.

Cite this