Skip to main navigation Skip to search Skip to main content

EFIN: A Novel Enhanced Feature Interaction Network for Temporal Sentence Grounding in Videos

  • Chongxu Hu
  • , Xianbin Wen
  • , Yibo Zhao
  • , Chunjie Ma
  • , Weili Guan
  • , Riwei Wang
  • , Zan Gao*
  • *Corresponding author for this work
  • Tianjin University of Technology
  • Qilu University of Technology
  • Harbin Institute of Technology Shenzhen
  • Wenzhou University of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Temporal sentence grounding in videos (TSGV) is a challenging task that aims to match text queries with semantically relevant segments in untrimmed videos. However, existing methods face limitations in modeling modality features, which constrains the expressive power of candidate moment features. To address this challenge, we propose a novel Enhanced Feature Interaction Network (EFIN) that effectively captures semantic information within each modality and aligns relationships between modalities. Additionally, EFIN enhances the fusion of information between candidate moments and modality features. Specifically, our model begins by extracting modality features to generate candidate moments as priors. Building upon these modality features, we introduce an enhanced feature encoder to extract semantic information within each modality, thereby improving intra-modality feature representation. Simultaneously, the encoder captures alignment relationships between modalities to optimize cross-modality feature representation, enhancing the overall modeling capacity of modality features. Moreover, we design an information fusion module to enrich the comprehension of modality information for candidate moments. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed EFIN model. Notably, EFIN achieves a maximum performance improvement of approximately 1.67% and 1.91% across different evaluation metrics on TACoS dataset.

Original languageEnglish
JournalIEEE Transactions on Multimedia
DOIs
StateAccepted/In press - 2026
Externally publishedYes

Keywords

  • Enhanced Feature Encoder
  • Enhanced Feature Interaction Network
  • Information Fusion Module
  • Temporal Sentence Grounding in Videos

Fingerprint

Dive into the research topics of 'EFIN: A Novel Enhanced Feature Interaction Network for Temporal Sentence Grounding in Videos'. Together they form a unique fingerprint.

Cite this