Skip to main navigation Skip to search Skip to main content

MSG-CLIP: Enhancing CLIP's ability to learn fine-grained structural associations through multi-modal scene graph alignment

  • Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

As a typical representative of Vision-Language foundation models, the Contrastive Language-Image Pre-training (CLIP) framework has garnered extensive attention due to its cross-modal understanding capabilities. Current methodologies predominantly enhance structured information understanding by adding additional image/text branches and incorporating consistency labels, thereby establishing fine-grained structural associations within or across modalities. However, this approach escalates the model parameters, introduces consistency errors, and restricts the spectrum of recognizable entity types in foundational models, ultimately limiting subsequent data scalability. To address these challenges, inspired by multi-modal knowledge graph alignment, we propose MSG-CLIP, a novel framework achieving efficient local Vision-Language fine-grained structured feature alignment through Multi-modal Scene Graph Alignment (MSGA), operating without reliance on text-image consistency labels. Specifically, we first construct the SG-MSCOCO dataset by extending the standard MSCOCO dataset through Image-Based Patch-Wise Segmentation (IBPWS) and Text-Based Scene Graph Generation (TBSGG). Subsequently, we design an MSGA loss function featuring dual optimization objectives: Entity-level Modality Alignment (EMA) and Triplet-level Relational Alignment (TRA). Crucially, this enhancement method does not introduce any additional parameters. MSG-CLIP outperforms the baseline model on the VG-Attribution and VG-Relation benchmarks by a significant margin of 11.2 % and 2.5 %, respectively. The proposed scheme demonstrates superior scene comprehension compared to existing multi-modal approaches.

Original languageEnglish
Article number112794
JournalPattern Recognition
Volume173
DOIs
StatePublished - May 2026

Keywords

  • Fine-grained structural associations
  • Multi-modal scene graph alignment
  • Patch-wise segmentation
  • Vision-language foundation model

Fingerprint

Dive into the research topics of 'MSG-CLIP: Enhancing CLIP's ability to learn fine-grained structural associations through multi-modal scene graph alignment'. Together they form a unique fingerprint.

Cite this