TY - GEN
T1 - An Empirical Study of LLM-as-a-Judge for LLM Evaluation
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Huang, Hui
AU - Bu, Xingyuan
AU - Zhou, Hongli
AU - Qu, Yingqi
AU - Liu, Jing
AU - Yang, Muyun
AU - Xu, Bing
AU - Zhao, Tiejun
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the finetuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.
AB - Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the finetuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.
UR - https://www.scopus.com/pages/publications/105028594699
U2 - 10.18653/v1/2025.findings-acl.306
DO - 10.18653/v1/2025.findings-acl.306
M3 - 会议稿件
AN - SCOPUS:105028594699
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 5880
EP - 5895
BT - Findings of the Association for Computational Linguistics
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -