Skip to main navigation Skip to search Skip to main content

A Survey on Video Temporal Grounding With Multimodal Large Language Model

  • Jianlong Wu
  • , Wei Liu
  • , Ye Liu
  • , Meng Liu*
  • , Liqiang Nie*
  • , Zhouchen Lin
  • , Chang Wen Chen
  • *Corresponding author for this work
  • School of Computer Science and Technology, Harbin Institute of Technology
  • Pengcheng Laboratory
  • Shenzhen Loop Area Institute
  • Hong Kong Polytechnic University
  • Shandong Jianzhu University
  • Peking University
  • Guangdong Artificial Intelligence and Digital Economy Laboratory - Guangzhou

Research output: Contribution to journalReview articlepeer-review

Abstract

The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions.

Original languageEnglish
Pages (from-to)1521-1541
Number of pages21
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume48
Issue number2
DOIs
StatePublished - Feb 2026
Externally publishedYes

Keywords

  • Video-language understanding
  • fine-grained temporal understanding
  • large language model
  • multimodal learning
  • video temporal grounding
  • vision-language model

Fingerprint

Dive into the research topics of 'A Survey on Video Temporal Grounding With Multimodal Large Language Model'. Together they form a unique fingerprint.

Cite this