Skip to main navigation Skip to search Skip to main content

Caption Assisted Multimodal Large Language Model for Video Moment Retrieval

  • Harbin Institute of Technology Shenzhen
  • The Chinese University of Hong Kong, Shenzhen

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely retrieve specific moments from a video, which require fine-grained spatial and temporal understanding of a video. To overcome this, we propose the Caption Assisted MLLM from Coarse to finE (CALCE), a novel two-stage framework designed for enhanced moment retrieval. Our pipeline begins with a first stage where captions extracted from the audio are utilized to assist the MLLM to provide a robust foundation for precise moment retrieval. To efficiently manage memory consumption from this additional data, a clustering algorithm is applied to the sparsely sampled video frames, categorizing them into key frames and non-key frames. The second stage focuses on recalling missed moments and achieving more fine-grained moment boundaries by adopting a higher sampling rate. In this process, predictions from the first stage cast votes for their correlated densely sampled frames, thereby filtering out less relevant frames. By repeating the process of the first stage with these selected frames, CALCE progressively retrieves video moments from coarse to precise. Experiments on QVHighlights and Charades-STA demonstrate the effectiveness of CALCE, which outperforms existing state-of-the-art methods.

Original languageEnglish
Pages (from-to)6755-6766
Number of pages12
JournalIEEE Transactions on Image Processing
Volume34
DOIs
StatePublished - 2025
Externally publishedYes

Keywords

  • Moment retrieval
  • multi-stage training
  • multimodal large language models

Fingerprint

Dive into the research topics of 'Caption Assisted Multimodal Large Language Model for Video Moment Retrieval'. Together they form a unique fingerprint.

Cite this