Abstract
In recent years, the use of large-scale pre-trained models for vision-language tasks has gained significant attention and has shown promising results in the video question answering. However, the increasing size of these models has made the fully fine-tuning strategy impractical. Therefore, there is a growing need for research in parameter-efficient transfer learning for downstream tasks. To address this challenge, we introduce a novel parameter-efficient transfer learning technique based on a temporal reasoning adapter for the video question answering task. Our proposed approach captures the temporal relationship within videos, enabling the model to possess visual reasoning ability and knowledge acquisition ability from language models. Our extensive experiments on four video question answering datasets indicate that our method can match or even outperform fully fine-tuning strategies and state-of-the-art models, while having the advantage of parameter efficiency.
| Original language | English |
|---|---|
| Pages (from-to) | 2232-2242 |
| Number of pages | 11 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 27 |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
Keywords
- Video question answering
- adapter
- parameter-efficient transfer learning
Fingerprint
Dive into the research topics of 'TR-Adapter: Parameter-Efficient Transfer Learning for Video Question Answering'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver