Abstract
Vision Transformer (ViT) has demonstrated powerful feature learning capabilities in image pre-training, yet its potential in video correspondence tasks remains unexplored. While it is possible to fully fine-tune the ViT for such tasks, this approach entails high computational costs that are often unnecessary. To overcome this, we introduce the I2V-Adapter, a lightweight module with a triplet-based inter-frame consistency loss and a region-wise intra-frame contrastive loss. Designed to rapidly adapt image pre-trained ViTs for video correspondence tasks while keeping the original ViT parameters frozen, our method leverages the inter-frame loss to capture temporal coherence and the intra-frame loss to enhance spatial discrimination within each frame. Extensive experiments demonstrate that the I2V-Adapter outperforms existing methods across various video tasks, including video object segmentation, body part propagation, and human keypoint tracking. Furthermore, the I2V-Adapter is computationally efficient, requiring only approximately 2.6 h (17.3% of the time needed for full fine-tuning) for training on a single NVIDIA RTX 3090 GPU.
| Original language | English |
|---|---|
| Article number | 113228 |
| Journal | Pattern Recognition |
| Volume | 177 |
| DOIs | |
| State | Published - Sep 2026 |
Keywords
- Fast adaptation
- Image pre-trained
- Self-supervised
- Video correspondence learning
Fingerprint
Dive into the research topics of 'I2V-Adapter: Fast adapting image pre-trained models for video correspondence'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver