Abstract
Pedestrian Attribute Recognition (PAR) models based on static images struggle to handle issues such as occlusion and motion blur, and recently proposed video-PAR models have not fully utilized the potential of larger models, resulting in sub-optimal performance. In this work, we propose a video-PAR framework that leverages temporal information by efficiently fine-tuning a multi-modal foundation model. Specifically, we cast video-based PAR as a vision-language fusion task, using CLIP for visual feature extraction and prompt engineering to convert attributes into sentences for text embedding. We introduce a spatiotemporal side-tuning strategy for parameter-efficient optimization and fuse visual and textual tokens via a Transformer for interactive learning. The enhanced tokens are used for final attribute prediction. Experiments on two video-PAR datasets validate the effectiveness of our method. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR .
| Original language | English |
|---|---|
| Article number | 104588 |
| Journal | Computer Vision and Image Understanding |
| Volume | 263 |
| DOIs | |
| State | Published - Jan 2026 |
| Externally published | Yes |
Keywords
- Multi-modal fusion
- Side tuning
- Video-based pedestrian attribute recognition
- Vision-language
Fingerprint
Dive into the research topics of 'Spatio-temporal side tuning pre-trained foundation models for video-based pedestrian attribute recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver