Skip to main navigation Skip to search Skip to main content

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

  • Harbin Institute of Technology
  • Sun Yat-Sen University

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual understanding and reasoning, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance the overall capabilities of LLMs, which could be regarded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce Modular Visual Memory (MVM), a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixture of Multimodal Experts (MoMEs) architecture in LLMs to invoke multimodal knowledge collaboration during text generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on image-text understanding multimodal benchmarks.

Original languageEnglish
Pages (from-to)858-871
Number of pages14
JournalIEEE Transactions on Image Processing
Volume35
DOIs
StatePublished - 2026
Externally publishedYes

Keywords

  • Multimodal large language model
  • image-text understanding
  • vision enhancing LLM

Fingerprint

Dive into the research topics of 'Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs'. Together they form a unique fingerprint.

Cite this