Skip to main navigation Skip to search Skip to main content

GFMLLM: Enhance multi-modal large language model for global and fine-grained visual spatial perception

  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

In the past, Multimodal Large Language Models (MLLMs) with visual spatial perceptual capabilities can only handle detection task and perform coarse-grained perceptions through language responses. To achieve fine-grained outputs (e.g., masks), additional task-specific expert models and loss terms are required, which inevitably increases model size, adds training overhead and implicitly weakens the scalability of MLLM. To this end, we present GFMLLM, a multimodal large language model that performs global and fine-grained visual perception without the assistance of expert models. It leverages a unified framework to produce spatially detailed feedback in linguistic form. Technically, we propose a unified textual response paradigm, called the perception text matrix, which utilizes a dense query mechanism of MLLM and intuitively reflects the detailed spatial distribution of target objects within an image. By further processing the information encoded in the matrix, it can be naturally extended to referring expression understanding (REC) task, or collaborated with expert model to handle the more fine-grained referring expression segmentation (RES) task. Moreover, to address the high generation latency caused by the long text of the perception text matrix, we propose a token aggregation technique that significantly reduces the inference latency of GFMLLM. Experimental results in public datasets validate the effectiveness of our approach.

Original languageEnglish
Article number130239
JournalExpert Systems with Applications
Volume299
DOIs
StatePublished - 1 Mar 2026

Keywords

  • Multimodal large language models
  • Referring expression comprehension
  • Referring expression segmentation

Fingerprint

Dive into the research topics of 'GFMLLM: Enhance multi-modal large language model for global and fine-grained visual spatial perception'. Together they form a unique fingerprint.

Cite this