Skip to main navigation Skip to search Skip to main content

Group-Relative Visual Discrimination Enhancement for Unlocking Intrinsic Capability of MLLMs

  • Fang Peng
  • , Xiaoshan Yang*
  • , Yaowei Wang
  • , Changsheng Xu
  • *Corresponding author for this work
  • CAS - Institute of Automation
  • Pengcheng Laboratory
  • University of Chinese Academy of Sciences
  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Although Multimodal Large Language Models (MLLMs) have shown remarkable generalization across diverse vision-language tasks, recent studies reveal their limitations in visual discrimination. These challenges arise not from insufficient model capacity, but from existing training paradigms that favor linguistic priors over detailed visual analysis. While existing approaches address this limitation through external interventions such as feature integration or knowledge augmentation, we propose a Group-Relative Visual Discrimination Enhancement framework to unlock intrinsic capability of MLLMs and requires no external resources. Our method introduces a Group-Relative Reinforcement Learning paradigm equipped with a lightweight Visual Patch Selection Plugin to dynamically select discriminative visual tokens. The framework establishes a self-feedback loop between visual encoder and language decoder, leveraging the dual reward-penalty signals derived from the model's internal language feedback to optimize the visual focus, thereby enhancing the model's visual discrimination capabilities. Extensive experimental results across six visual recognition benchmarks and two VQA benchmarks demonstrate the effectiveness of our method. Code is available at https://github.com/FannierPeng/GROVE.

Original languageEnglish
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
StateAccepted/In press - 2026
Externally publishedYes

Keywords

  • Multimodal Large Language Models
  • Reinforcement Learning
  • Visual Recognition

Fingerprint

Dive into the research topics of 'Group-Relative Visual Discrimination Enhancement for Unlocking Intrinsic Capability of MLLMs'. Together they form a unique fingerprint.

Cite this