Skip to main navigation Skip to search Skip to main content

Enhancing Multimodal Reasoning via Visual Semantic Summarization

  • Jingjie Lin
  • , Qianlong Wang*
  • , Keyang Ding
  • , Bingbing Wang
  • , Xintong Song
  • , Wenpeng Lu
  • , Ruifeng Xu*
  • , Min Zhang
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Shenzhen Technology University
  • Ministry of Education of the People's Republic of China
  • Qilu University of Technology
  • Harbin Institute of Technology Shenzhen
  • Shenzhen Loop Area Institute
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in multimodal large models (MLMs) have enabled promising progress in multimodal reasoning. However, due to multimodal misalignment and language-dominant representations, MLMs often compress or lose fine-grained visual semantics. As a result, they struggle to (1) reliably perceive task-relevant visual details and (2) effectively leverage parametric knowledge conditioned on visual evidence. These limitations ultimately lead to unreliable reasoning outcomes. To address these limitations, we introduce a simple yet effective framework built upon visual semantic summarization. Our framework iteratively optimizes two complementary stages: one for generating high-quality visual semantic summaries containing reasoning-critical visual semantics, and another for strengthening the MLM's ability to utilize such summaries during multimodal inference. Acting as compact visual auxiliaries, the summaries guide attention toward fine-grained clues and facilitate knowledge retrieval from internal parameters. Extensive experiments on six multimodal reasoning benchmarks spanning three tasks (VQA, MABSA, and HMD) show that our framework consistently outperforms existing MLMs-based methods.

Original languageEnglish
Pages (from-to)2539-2551
Number of pages13
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume34
DOIs
StatePublished - 2026
Externally publishedYes

Keywords

  • Multimodal reasoning
  • multimodal large models
  • visual semantic summarization

Fingerprint

Dive into the research topics of 'Enhancing Multimodal Reasoning via Visual Semantic Summarization'. Together they form a unique fingerprint.

Cite this