Skip to main navigation Skip to search Skip to main content

I2V-Adapter: Fast adapting image pre-trained models for video correspondence

  • Hannan Lu
  • , Xinyu Zhang*
  • , Zhi Tian
  • , Xiaohe Wu
  • , Wangmeng Zuo
  • , Jingdong Wang
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • The University of Auckland
  • ByteDance Ltd.
  • Baidu Inc

Research output: Contribution to journalArticlepeer-review

Abstract

Vision Transformer (ViT) has demonstrated powerful feature learning capabilities in image pre-training, yet its potential in video correspondence tasks remains unexplored. While it is possible to fully fine-tune the ViT for such tasks, this approach entails high computational costs that are often unnecessary. To overcome this, we introduce the I2V-Adapter, a lightweight module with a triplet-based inter-frame consistency loss and a region-wise intra-frame contrastive loss. Designed to rapidly adapt image pre-trained ViTs for video correspondence tasks while keeping the original ViT parameters frozen, our method leverages the inter-frame loss to capture temporal coherence and the intra-frame loss to enhance spatial discrimination within each frame. Extensive experiments demonstrate that the I2V-Adapter outperforms existing methods across various video tasks, including video object segmentation, body part propagation, and human keypoint tracking. Furthermore, the I2V-Adapter is computationally efficient, requiring only approximately 2.6 h (17.3% of the time needed for full fine-tuning) for training on a single NVIDIA RTX 3090 GPU.

Original languageEnglish
Article number113228
JournalPattern Recognition
Volume177
DOIs
StatePublished - Sep 2026

Keywords

  • Fast adaptation
  • Image pre-trained
  • Self-supervised
  • Video correspondence learning

Fingerprint

Dive into the research topics of 'I2V-Adapter: Fast adapting image pre-trained models for video correspondence'. Together they form a unique fingerprint.

Cite this