Abstract
First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker"as onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1) Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and 2) Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations of online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
| Original language | English |
|---|---|
| Pages (from-to) | 3240-3251 |
| Number of pages | 12 |
| Journal | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025 |
Fingerprint
Dive into the research topics of 'LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver