TY - GEN
T1 - Towards Benchmarking Situational Awareness of Large Language Models Comprehensive Benchmark, Evaluation and Analysis
AU - Tang, Guo
AU - Chu, Zheng
AU - Zheng, Wenxiang
AU - Liu, Ming
AU - Qin, Bing
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Situational awareness refers to the capacity to perceive and comprehend the present context and anticipate forthcoming events, which plays a critical role in aiding decision-making, anticipating potential issues, and adapting to dynamic circumstances.Nevertheless, the situational awareness capabilities of large language models have not yet been comprehensively assessed.To address this, we propose SA-Bench, a comprehensive benchmark that covers three tiers of situational awareness capabilities, covering environment perception, situation comprehension and future projection.SA-Bench provides a comprehensive evaluation to explore the situational awareness capabilities of LLMs.We conduct extensive experiments on advanced LLMs, including GPT-4, LLaMA3, Qwen1.5, among others.Our experimental results indicate that even SOTA LLMs still exhibit substantial capability gaps compared to humans.In addition, we thoroughly analyze and examine the challenges encountered by LLMs across various tasks, as well as emphasize the deficiencies they confront.We hope SA-Bench will foster research within the field of situational awareness.
AB - Situational awareness refers to the capacity to perceive and comprehend the present context and anticipate forthcoming events, which plays a critical role in aiding decision-making, anticipating potential issues, and adapting to dynamic circumstances.Nevertheless, the situational awareness capabilities of large language models have not yet been comprehensively assessed.To address this, we propose SA-Bench, a comprehensive benchmark that covers three tiers of situational awareness capabilities, covering environment perception, situation comprehension and future projection.SA-Bench provides a comprehensive evaluation to explore the situational awareness capabilities of LLMs.We conduct extensive experiments on advanced LLMs, including GPT-4, LLaMA3, Qwen1.5, among others.Our experimental results indicate that even SOTA LLMs still exhibit substantial capability gaps compared to humans.In addition, we thoroughly analyze and examine the challenges encountered by LLMs across various tasks, as well as emphasize the deficiencies they confront.We hope SA-Bench will foster research within the field of situational awareness.
UR - https://www.scopus.com/pages/publications/85217619562
U2 - 10.18653/v1/2024.findings-emnlp.464
DO - 10.18653/v1/2024.findings-emnlp.464
M3 - 会议稿件
AN - SCOPUS:85217619562
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 7904
EP - 7928
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
T2 - 2024 Findings of the Association for Computational Linguistics, EMNLP 2024
Y2 - 12 November 2024 through 16 November 2024
ER -