Skip to main navigation Skip to search Skip to main content

Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation

  • Harbin Institute of Technology
  • Pengcheng Laboratory
  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.

Original languageEnglish
Pages (from-to)3092-3104
Number of pages13
JournalIEEE Transactions on Multimedia
Volume27
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Contrastive learning
  • long-tailed learning
  • video scene graph generation

Fingerprint

Dive into the research topics of 'Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation'. Together they form a unique fingerprint.

Cite this