Skip to main navigation Skip to search Skip to main content

Spatio-temporal side tuning pre-trained foundation models for video-based pedestrian attribute recognition

  • Xiao Wang
  • , Qian Zhu
  • , Jiandong Jin
  • , Jun Zhu
  • , Futian Wang*
  • , Bo Jiang
  • , Yaowei Wang
  • , Yonghong Tian
  • *Corresponding author for this work
  • School of Computer Science and Technology, Anhui University
  • School of Artificial Intelligence, Anhui University
  • Peng Cheng Laboratory
  • Harbin Institute of Technology Shenzhen
  • Peking University

Research output: Contribution to journalArticlepeer-review

Abstract

Pedestrian Attribute Recognition (PAR) models based on static images struggle to handle issues such as occlusion and motion blur, and recently proposed video-PAR models have not fully utilized the potential of larger models, resulting in sub-optimal performance. In this work, we propose a video-PAR framework that leverages temporal information by efficiently fine-tuning a multi-modal foundation model. Specifically, we cast video-based PAR as a vision-language fusion task, using CLIP for visual feature extraction and prompt engineering to convert attributes into sentences for text embedding. We introduce a spatiotemporal side-tuning strategy for parameter-efficient optimization and fuse visual and textual tokens via a Transformer for interactive learning. The enhanced tokens are used for final attribute prediction. Experiments on two video-PAR datasets validate the effectiveness of our method. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR .

Original languageEnglish
Article number104588
JournalComputer Vision and Image Understanding
Volume263
DOIs
StatePublished - Jan 2026
Externally publishedYes

Keywords

  • Multi-modal fusion
  • Side tuning
  • Video-based pedestrian attribute recognition
  • Vision-language

Fingerprint

Dive into the research topics of 'Spatio-temporal side tuning pre-trained foundation models for video-based pedestrian attribute recognition'. Together they form a unique fingerprint.

Cite this