Skip to main navigation Skip to search Skip to main content

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

  • Central South University
  • Soochow University
  • Harbin Institute of Technology
  • National University of Singapore

Research output: Contribution to journalConference articlepeer-review

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operations. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more breakthroughs on introducing multi-modal generation into the reasoning process.

Original languageEnglish
Pages (from-to)23678-23686
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number22
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models'. Together they form a unique fingerprint.

Cite this