TY - GEN
T1 - CMCOQA
T2 - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
AU - Li, Zijian
AU - Zhao, Sendong
AU - Wang, Haochun
AU - Xu, Haoming
AU - Qin, Bing
AU - Liu, Ting
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - With the development of Large Language Models (LLMs), many Chinese medical benchmarks have emerged. These benchmarks have primarily used multiple-choice questions and open-ended questions as test items. However, our experimental results indicate that using multiple-choice questions to test the capabilities of LLMs is not very reasonable. Additionally, relatively simple open-ended questions do not effectively assess LLMs' actual grasp of medical knowledge. Therefore, we propose the Chinese Medical Complex Open-Question Answering Benchmark (CMCOQA), designed to more accurately and efficiently evaluate the true medical proficiency of LLMs by constructing complex open-ended questions within medical scenarios. Our proposed benchmark involves three evaluation dimensions: Completeness, Depth, and Professionalism. Starting with 100 manually generated complex questions as seeds, we expand the set to 1,200 using the Self-Instruct method with GPT-4o. We then have GPT-4o self-check the questions, followed by a manual screening process to ensure a broad coverage and a certain level of depth. We have both humans and GPT-4o score from these three dimensions, while also employing automated metrics. We also calculate correlations between these metrics and human scores to validate the results. Through this work, CMCOQA can further promote the development of Chinese medical LLMs in terms of medical professionalism.
AB - With the development of Large Language Models (LLMs), many Chinese medical benchmarks have emerged. These benchmarks have primarily used multiple-choice questions and open-ended questions as test items. However, our experimental results indicate that using multiple-choice questions to test the capabilities of LLMs is not very reasonable. Additionally, relatively simple open-ended questions do not effectively assess LLMs' actual grasp of medical knowledge. Therefore, we propose the Chinese Medical Complex Open-Question Answering Benchmark (CMCOQA), designed to more accurately and efficiently evaluate the true medical proficiency of LLMs by constructing complex open-ended questions within medical scenarios. Our proposed benchmark involves three evaluation dimensions: Completeness, Depth, and Professionalism. Starting with 100 manually generated complex questions as seeds, we expand the set to 1,200 using the Self-Instruct method with GPT-4o. We then have GPT-4o self-check the questions, followed by a manual screening process to ensure a broad coverage and a certain level of depth. We have both humans and GPT-4o score from these three dimensions, while also employing automated metrics. We also calculate correlations between these metrics and human scores to validate the results. Through this work, CMCOQA can further promote the development of Chinese medical LLMs in terms of medical professionalism.
KW - Chinese Medical
KW - Large Language Model
KW - Medical Professionalism
KW - Open-Ended Complex Question Answering
UR - https://www.scopus.com/pages/publications/85217280620
U2 - 10.1109/BIBM62325.2024.10821873
DO - 10.1109/BIBM62325.2024.10821873
M3 - 会议稿件
AN - SCOPUS:85217280620
T3 - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
SP - 3402
EP - 3407
BT - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
A2 - Cannataro, Mario
A2 - Zheng, Huiru
A2 - Gao, Lin
A2 - Cheng, Jianlin
A2 - de Miranda, Joao Luis
A2 - Zumpano, Ester
A2 - Hu, Xiaohua
A2 - Cho, Young-Rae
A2 - Park, Taesung
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 December 2024 through 6 December 2024
ER -