Abstract
Temporal sentence grounding (TSG) aims to localize the temporal moment that semantically corresponds to a given natural language query in the untrimmed video. Great efforts have been made to solve the problem in both fully supervised and weakly supervised settings. However, fully supervised methods heavily rely on manually annotated start and end timestamps which are arduous to obtain, while weakly supervised methods suffer from performance issues due to the lack of supervision. In this paper, we propose to solve the temporal sentence grounding by exploring external data. Specifically, we design an Adversarial Temporal Sentence Grounding (ATSG) framework, comprising a proposal generator and a semantic discriminator which is firstly pre-trained on external data. Benefiting from the pre-training, the semantic discriminator possesses the ability to distinguish cross-modal semantic similarities and encourages the proposal generator to produce more accurate candidates. In addition, we use an adversarial training process in the joint optimization stage where the proposal generator and the semantic discriminator compete alternately, ultimately leading to improved TSG performance. We conduct extensive experiments on two public benchmarks, i.e., ActivityNet Captions and Charades-STA, and the results demonstrate that the proposed ATSG network achieves state-of-the-art performance.
| Original language | English |
|---|---|
| Article number | 111621 |
| Journal | Pattern Recognition |
| Volume | 165 |
| DOIs | |
| State | Published - Sep 2025 |
| Externally published | Yes |
Keywords
- Adversarial training
- Cross-modal alignment
- External data
- Temporal sentence grounding
Fingerprint
Dive into the research topics of 'Adversarial temporal sentence grounding by learning from external data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver