Skip to main navigation Skip to search Skip to main content

Semantic-Aware Contrastive Learning With Proposal Suppression for Video Semantic Role Grounding

  • Meng Liu
  • , Di Zhou
  • , Jie Guo
  • , Xin Luo
  • , Zan Gao
  • , Liqiang Nie*
  • *Corresponding author for this work
  • Shandong Jianzhu University
  • Shandong University
  • Qilu University of Technology
  • School of Computer Science and Technology, Harbin Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Video semantic role grounding has gained substantial interest from both the academic and industrial communities. While existing methods have demonstrated considerable performance improvements, the influence of noisy and intra-object proposals, referring to proposals with the same object label, has yet to be explored in video semantic role grounding. In this study, we propose a semantic-aware contrastive learning network with proposal suppression to enhance the accuracy of grounding referenced objects. To fully exploit the semantic information in each semantic role, we introduce a novel semantic role encoding module that allows for precise representations of each semantic role. We also design a semantic-aware proposal suppression network to reduce the impact of noisy proposals on object representation learning. Additionally, we propose a proposal contrastive loss to improve cross-modal alignment and reduce the effect of irrelevant intra-object proposals. Extensive experiments on four datasets demonstrate that our model achieves significant improvements over state-of-the-art methods.

Original languageEnglish
Pages (from-to)3003-3016
Number of pages14
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number4
DOIs
StatePublished - 1 Apr 2024
Externally publishedYes

Keywords

  • Video semantic role grounding
  • cross-modal retrieval
  • proposal contrastive learning

Fingerprint

Dive into the research topics of 'Semantic-Aware Contrastive Learning With Proposal Suppression for Video Semantic Role Grounding'. Together they form a unique fingerprint.

Cite this