Abstract
As a typical representative of Vision-Language foundation models, the Contrastive Language-Image Pre-training (CLIP) framework has garnered extensive attention due to its cross-modal understanding capabilities. Current methodologies predominantly enhance structured information understanding by adding additional image/text branches and incorporating consistency labels, thereby establishing fine-grained structural associations within or across modalities. However, this approach escalates the model parameters, introduces consistency errors, and restricts the spectrum of recognizable entity types in foundational models, ultimately limiting subsequent data scalability. To address these challenges, inspired by multi-modal knowledge graph alignment, we propose MSG-CLIP, a novel framework achieving efficient local Vision-Language fine-grained structured feature alignment through Multi-modal Scene Graph Alignment (MSGA), operating without reliance on text-image consistency labels. Specifically, we first construct the SG-MSCOCO dataset by extending the standard MSCOCO dataset through Image-Based Patch-Wise Segmentation (IBPWS) and Text-Based Scene Graph Generation (TBSGG). Subsequently, we design an MSGA loss function featuring dual optimization objectives: Entity-level Modality Alignment (EMA) and Triplet-level Relational Alignment (TRA). Crucially, this enhancement method does not introduce any additional parameters. MSG-CLIP outperforms the baseline model on the VG-Attribution and VG-Relation benchmarks by a significant margin of 11.2 % and 2.5 %, respectively. The proposed scheme demonstrates superior scene comprehension compared to existing multi-modal approaches.
| Original language | English |
|---|---|
| Article number | 112794 |
| Journal | Pattern Recognition |
| Volume | 173 |
| DOIs | |
| State | Published - May 2026 |
Keywords
- Fine-grained structural associations
- Multi-modal scene graph alignment
- Patch-wise segmentation
- Vision-language foundation model
Fingerprint
Dive into the research topics of 'MSG-CLIP: Enhancing CLIP's ability to learn fine-grained structural associations through multi-modal scene graph alignment'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver