Abstract
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.
| Original language | English |
|---|---|
| Title of host publication | MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 5460-5469 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798400706868 |
| DOIs | |
| State | Published - 28 Oct 2024 |
| Externally published | Yes |
| Event | 32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024 |
Publication series
| Name | MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia |
|---|
Conference
| Conference | 32nd ACM International Conference on Multimedia, MM 2024 |
|---|---|
| Country/Territory | Australia |
| City | Melbourne |
| Period | 28/10/24 → 1/11/24 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- hierarchical
- low-rank adaptation
- multimodality
- referring expression comprehension
- visual grounding
Fingerprint
Dive into the research topics of 'HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver