Skip to main navigation Skip to search Skip to main content

EIM: An effective solution for improving multi-modal large language models

  • Yuting Bai
  • , Tonghua Su*
  • , Zixing Bai*
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Fudan University

Research output: Contribution to journalArticlepeer-review

Abstract

Enabling large language models (LLMs) to have multi-modal capabilities, such as vision-language learning, has become a current research hotspot and the next milestone in LLM development with the advent of models like GPT4. The basic structure of current multi-modal LLMs usually includes three parts: the image encoder for extracting visual features, the semantic space transformation network ST for aligning the multi-modal semantic spaces, and LLM for generating text. Current works on multi-modal LLMs primarily focus on enhancing performance by utilizing larger image encoders and LLMs, and designing more complex fine-tuning methods and STs, which results in an escalation of model parameters. In this paper, we propose EIM, a novel effective solution for improving the performance of multi-modal large language models from the perspective of training process which reduces the need to introduce new parameters and modify the model structure, and is ignored and less explored in current research. EIM includes corresponding improvement measures in the image encoder, ST, and LLM. To validate EIM, we first apply it to ClipCap and conduct experiments on the COCO Caption dataset. Secondly, we extend EIM to the multi-modal LLMs, such as LLaMA-Adapter and LaVIN, and evaluate them on the ScienceQA dataset. Finally, we also conduct multi-modal chatbot experiments with the EIM enhanced LaVIN and evaluate it on the MME benchmark. The COCO Caption dataset experimental results of ClipCapeim, which is a model that applies EIM on the ClipCapsmall, show the 1.75% performance improvement when compared to those of ClipCaplarge, which has 3.13 times the number of parameters of ClipCapeim. The experimental results on the ScienceQA dataset and MME benchmark show that EIM can achieve competitive performance with 7B model parameters when compared to the 13B multi-modal LLMs, which confirms the effective performance improvement of EIM for multi-modal LLMs.

Original languageEnglish
Article numbere0329590
JournalPLOS ONE
Volume20
Issue number8 August
DOIs
StatePublished - Aug 2025

Fingerprint

Dive into the research topics of 'EIM: An effective solution for improving multi-modal large language models'. Together they form a unique fingerprint.

Cite this