Abstract
Although Multimodal Large Language Models (MLLMs) have shown remarkable generalization across diverse vision-language tasks, recent studies reveal their limitations in visual discrimination. These challenges arise not from insufficient model capacity, but from existing training paradigms that favor linguistic priors over detailed visual analysis. While existing approaches address this limitation through external interventions such as feature integration or knowledge augmentation, we propose a Group-Relative Visual Discrimination Enhancement framework to unlock intrinsic capability of MLLMs and requires no external resources. Our method introduces a Group-Relative Reinforcement Learning paradigm equipped with a lightweight Visual Patch Selection Plugin to dynamically select discriminative visual tokens. The framework establishes a self-feedback loop between visual encoder and language decoder, leveraging the dual reward-penalty signals derived from the model's internal language feedback to optimize the visual focus, thereby enhancing the model's visual discrimination capabilities. Extensive experimental results across six visual recognition benchmarks and two VQA benchmarks demonstrate the effectiveness of our method. Code is available at https://github.com/FannierPeng/GROVE.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| DOIs | |
| State | Accepted/In press - 2026 |
| Externally published | Yes |
Keywords
- Multimodal Large Language Models
- Reinforcement Learning
- Visual Recognition
Fingerprint
Dive into the research topics of 'Group-Relative Visual Discrimination Enhancement for Unlocking Intrinsic Capability of MLLMs'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver