Abstract
Recent advances in multimodal large models (MLMs) have enabled promising progress in multimodal reasoning. However, due to multimodal misalignment and language-dominant representations, MLMs often compress or lose fine-grained visual semantics. As a result, they struggle to (1) reliably perceive task-relevant visual details and (2) effectively leverage parametric knowledge conditioned on visual evidence. These limitations ultimately lead to unreliable reasoning outcomes. To address these limitations, we introduce a simple yet effective framework built upon visual semantic summarization. Our framework iteratively optimizes two complementary stages: one for generating high-quality visual semantic summaries containing reasoning-critical visual semantics, and another for strengthening the MLM's ability to utilize such summaries during multimodal inference. Acting as compact visual auxiliaries, the summaries guide attention toward fine-grained clues and facilitate knowledge retrieval from internal parameters. Extensive experiments on six multimodal reasoning benchmarks spanning three tasks (VQA, MABSA, and HMD) show that our framework consistently outperforms existing MLMs-based methods.
| Original language | English |
|---|---|
| Pages (from-to) | 2539-2551 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Audio, Speech and Language Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2026 |
| Externally published | Yes |
Keywords
- Multimodal reasoning
- multimodal large models
- visual semantic summarization
Fingerprint
Dive into the research topics of 'Enhancing Multimodal Reasoning via Visual Semantic Summarization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver