Skip to main navigation Skip to search Skip to main content

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

  • Xinqi Liu
  • , Li Zhou
  • , Zikun Zhou*
  • , Jianqiu Chen
  • , Zhenyu He*
  • *Corresponding author for this work
  • Harbin Institute of Technology Shenzhen
  • Pengcheng Laboratory

Research output: Contribution to journalConference articlepeer-review

Abstract

The vision-language tracking task aims to perform object tracking based on various modality references. Existing Transformer-based vision-language tracking methods have made remarkable progress by leveraging the global modeling ability of self-attention. However, current approaches still face challenges in effectively exploiting the temporal information and dynamically updating reference features during tracking. Recently, the State Space Model (SSM), known as Mamba, has shown astonishing ability in efficient long-sequence modeling. Particularly, its state space evolving process demonstrates promising capabilities in memorizing multimodal temporal information with linear complexity. Witnessing its success, we propose a Mamba-based vision-language tracking model to exploit its state space evolving ability in temporal space for robust multimodal tracking, dubbed MambaVLT. In particular, our approach mainly integrates a time-evolving hybrid state space block and a selective locality enhancement block, to capture contextual information for multimodal modeling and adaptive reference feature update. Besides, we introduce a modality-selection module that dynamically adjusts the weighting between visual and language references, mitigating potential ambiguities from either reference type. Extensive experimental results show that our method performs favorably against state-of-the-art trackers across diverse benchmarks.

Original languageEnglish
Pages (from-to)8731-8741
Number of pages11
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2025
Externally publishedYes
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Keywords

  • mamba
  • state space model
  • time-evolving
  • vision-language tracking

Fingerprint

Dive into the research topics of 'MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking'. Together they form a unique fingerprint.

Cite this