Abstract
The next recognized development direction of large language models is integrating and enhancing multimodal capability. Current vision large language models (VLLMs) have achieved impressive performance by connecting the visual encoder and LLM by the MLP or resampler network. They also adopt parameter-efficient transfer learning methods like the adapter to reduce training costs and adapt the model to downstream tasks. However, there are still limitations in the adapter, projection, and visual feature utilization. General adapters ignore semantic differences between different modalities as they treat different modalities consistently, and existing dedicated VLLM adapters treat different modalities independently and ignore the correlation between them. Resampler projections are often complex and expensive, and MLP projections have the natural flaw of inability to change the number of tokens, which limits the application scenarios. The mainstream works mainly focus on using patch visual features extracted from the image encoder model CLIP and ignore the role of global visual features. In this paper, we propose a novel parameter-efficient framework to solve the above limitations. Firstly, we propose a new dedicated cross-modal VLLM adapter that handles different modalities differently and builds the connection between them simultaneously. Secondly, we propose a new token-configurable projection structure that inherits the efficiency of traditional MLP projections and has the ability to flexibly change the number of tokens, thus broadening the application scenarios. Finally, we explore the role of CLIP vertical global visual features extracted from CLIP's all layers and propose a global-patch-aware feature utilization solution that combines the vertical global and patch visual features. To validate the performance, we propose a new VLLM and conduct two types of experiments: evaluate the specific domain performance on the ScienceQA dataset and evaluate the zero-shot performance of the multimodal chatbot on multiple benchmarks: AI2D, MMMU, CMMMU, InfographicVQA, and MME. In the ScienceQA experiments, our model achieves a new VLLM state-of-the-art result of 94.69% which is higher than the previous state-of-the-art VLLM result of 94.39%. In multimodal chatbot experiments, our model achieves competitive performance when compared to state-of-the-art chatbot models with significantly lower training costs.
| Original language | English |
|---|---|
| Article number | 133742 |
| Journal | Neurocomputing |
| Volume | 688 |
| DOIs | |
| State | Published - 1 Aug 2026 |
Keywords
- Large language models
- Multimodal
- Vision and language
Fingerprint
Dive into the research topics of 'Adapt, project and fuse: Parameter-efficient framework for vision large language models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver