Skip to main navigation Skip to search Skip to main content

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

  • Yi Chen
  • , Yuying Ge*
  • , Yixiao Ge
  • , Mingyu Ding
  • , Bohao Li
  • , Rui Wang
  • , Ruifeng Xu
  • , Ying Shan
  • , Xihui Liu*
  • *Corresponding author for this work
  • The University of Hong Kong
  • Tencent
  • Tencent Pcg
  • University of California at Berkeley
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all the codes, data, and a maintained benchmark leaderboard available at https://chenyi99.github.io/ego_plan/ to advance future research.

Original languageEnglish
Article number118
JournalInternational Journal of Computer Vision
Volume134
Issue number3
DOIs
StatePublished - Mar 2026
Externally publishedYes

Keywords

  • Egocentric video
  • Human-level planning
  • Multimodal large language model benchmark
  • Real-world scenario

Fingerprint

Dive into the research topics of 'EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning'. Together they form a unique fingerprint.

Cite this