Skip to main navigation Skip to search Skip to main content

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

  • Wei Li
  • , Bing Hu
  • , Rui Shao*
  • , Leyang Shen
  • , Liqiang Nie
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen

Research output: Contribution to journalConference articlepeer-review

Abstract

First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker"as onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1) Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and 2) Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations of online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.

Original languageEnglish
Pages (from-to)3240-3251
Number of pages12
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2025
Externally publishedYes
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Fingerprint

Dive into the research topics of 'LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant'. Together they form a unique fingerprint.

Cite this