TY - GEN
T1 - SCEDIT
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Li, Xinye
AU - Zheng, Zunwen
AU - Zhang, Qian
AU - Zhuang, Dekai
AU - Kang, Jiabao
AU - Xu, Liyan
AU - Liu, Qingbin
AU - Chen, Xi
AU - Tu, Zhiying
AU - Chu, Dianhui
AU - Sui, Dianbo
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark - SCEDIT (Script-based Knowledge Editing Benchmark) - which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
AB - Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark - SCEDIT (Script-based Knowledge Editing Benchmark) - which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
UR - https://www.scopus.com/pages/publications/105028579152
U2 - 10.18653/v1/2025.findings-acl.104
DO - 10.18653/v1/2025.findings-acl.104
M3 - 会议稿件
AN - SCOPUS:105028579152
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 2032
EP - 2052
BT - Findings of the Association for Computational Linguistics
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -