Abstract
The datasets in natural language video localization are relabeled from other tasks, leading to severe bias issues that hinder effective model training. Current methods primarily address distributional and modal biases in datasets but lack comprehensive solutions for the two types of annotation biases introduced during dataset labeling. To tackle this problem, we propose a multimodal-guided mixture-of-expert bias removal strategy. This method simulates diverse query statements by introducing gaussian noise, employs multiple general experts to mimic different annotation tendencies, and utilizes a shared expert to extract common features from the annotation process, thereby addressing uncertainty in target moment annotations. To better balance the contributions of multiple experts, we introduce auxiliary losses, including importance loss, load loss, and KL divergence loss. Extensive experiments on two widely used datasets, Charades-STA and ActivityNet Captions, along with implementation across four backbone networks, demonstrate the effectiveness of our approach.
| Original language | English |
|---|---|
| Article number | 61 |
| Journal | Multimedia Systems |
| Volume | 32 |
| Issue number | 1 |
| DOIs | |
| State | Published - Feb 2026 |
| Externally published | Yes |
Keywords
- Auxiliary Loss
- Mixture-of-experts
- Multimodal-guided
- Natural Language Video Localization
Fingerprint
Dive into the research topics of 'Multimodal-guided mixture-of-experts bias removal strategy for natural language video localization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver