Skip to main navigation Skip to search Skip to main content

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

  • Zaijing Li
  • , Yuquan Xie
  • , Rui Shao*
  • , Gongwei Chen
  • , Dongmei Jiang
  • , Liqiang Nie*
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • Peng Cheng Laboratory

Research output: Contribution to journalConference articlepeer-review

Abstract

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a longterm goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for highlevel planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixedlength behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.

Original languageEnglish
Pages (from-to)9039-9049
Number of pages11
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2025
Externally publishedYes
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Keywords

  • multimodal large language model
  • open-world agent

Fingerprint

Dive into the research topics of 'Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy'. Together they form a unique fingerprint.

Cite this