Skip to main navigation Skip to search Skip to main content

Adapt, project and fuse: Parameter-efficient framework for vision large language models

  • Yuting Bai
  • , Tonghua Su*
  • , Zixing Bai
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Fudan University

Research output: Contribution to journalArticlepeer-review

Abstract

The next recognized development direction of large language models is integrating and enhancing multimodal capability. Current vision large language models (VLLMs) have achieved impressive performance by connecting the visual encoder and LLM by the MLP or resampler network. They also adopt parameter-efficient transfer learning methods like the adapter to reduce training costs and adapt the model to downstream tasks. However, there are still limitations in the adapter, projection, and visual feature utilization. General adapters ignore semantic differences between different modalities as they treat different modalities consistently, and existing dedicated VLLM adapters treat different modalities independently and ignore the correlation between them. Resampler projections are often complex and expensive, and MLP projections have the natural flaw of inability to change the number of tokens, which limits the application scenarios. The mainstream works mainly focus on using patch visual features extracted from the image encoder model CLIP and ignore the role of global visual features. In this paper, we propose a novel parameter-efficient framework to solve the above limitations. Firstly, we propose a new dedicated cross-modal VLLM adapter that handles different modalities differently and builds the connection between them simultaneously. Secondly, we propose a new token-configurable projection structure that inherits the efficiency of traditional MLP projections and has the ability to flexibly change the number of tokens, thus broadening the application scenarios. Finally, we explore the role of CLIP vertical global visual features extracted from CLIP's all layers and propose a global-patch-aware feature utilization solution that combines the vertical global and patch visual features. To validate the performance, we propose a new VLLM and conduct two types of experiments: evaluate the specific domain performance on the ScienceQA dataset and evaluate the zero-shot performance of the multimodal chatbot on multiple benchmarks: AI2D, MMMU, CMMMU, InfographicVQA, and MME. In the ScienceQA experiments, our model achieves a new VLLM state-of-the-art result of 94.69% which is higher than the previous state-of-the-art VLLM result of 94.39%. In multimodal chatbot experiments, our model achieves competitive performance when compared to state-of-the-art chatbot models with significantly lower training costs.

Original languageEnglish
Article number133742
JournalNeurocomputing
Volume688
DOIs
StatePublished - 1 Aug 2026

Keywords

  • Large language models
  • Multimodal
  • Vision and language

Fingerprint

Dive into the research topics of 'Adapt, project and fuse: Parameter-efficient framework for vision large language models'. Together they form a unique fingerprint.

Cite this