TY - GEN
T1 - LCQMC
T2 - 27th International Conference on Computational Linguistics, COLING 2018
AU - Liu, Xin
AU - Chen, Qingcai
AU - Deng, Chong
AU - Zeng, Huajun
AU - Chen, Jing
AU - Li, Dongfang
AU - Tang, Buzhou
N1 - Publisher Copyright:
© 2018 COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings. All rights reserved.
PY - 2018
Y1 - 2018
N2 - The lack of large-scale question matching corpora greatly limits the development of matching methods in question answering (QA) system, especially for non-English languages. To ameliorate this situation, in this paper, we introduce a large-scale Chinese question matching corpus (named LCQMC), which is released to the public1. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. How to collect a large number of question pairs in variant linguistic forms, which may present the same intent, is the key point for such corpus construction. In this paper, we first use a search engine to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs. After this process, a question matching corpus that contains 260,068 question pairs is constructed. In order to verify the LCQMC corpus, we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs, and test several well-known sentence matching methods on it. The experimental results not only demonstrate the good quality of LCQMC, but also provide solid baseline performance for further researches on this corpus.
AB - The lack of large-scale question matching corpora greatly limits the development of matching methods in question answering (QA) system, especially for non-English languages. To ameliorate this situation, in this paper, we introduce a large-scale Chinese question matching corpus (named LCQMC), which is released to the public1. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. How to collect a large number of question pairs in variant linguistic forms, which may present the same intent, is the key point for such corpus construction. In this paper, we first use a search engine to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs. After this process, a question matching corpus that contains 260,068 question pairs is constructed. In order to verify the LCQMC corpus, we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs, and test several well-known sentence matching methods on it. The experimental results not only demonstrate the good quality of LCQMC, but also provide solid baseline performance for further researches on this corpus.
UR - https://www.scopus.com/pages/publications/85117727630
M3 - 会议稿件
AN - SCOPUS:85117727630
T3 - COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings
SP - 1952
EP - 1962
BT - COLING 2018 - 27th International Conference on Computational Linguistics, Proceedings
A2 - Bender, Emily M.
A2 - Derczynski, Leon
A2 - Isabelle, Pierre
PB - Association for Computational Linguistics (ACL)
Y2 - 20 August 2018 through 26 August 2018
ER -