Skip to main navigation Skip to search Skip to main content

Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual Question Answering

  • Yaguang Song
  • , Xiaoshan Yang
  • , Yaowei Wang
  • , Changsheng Xu*
  • *Corresponding author for this work
  • CAS - Institute of Automation
  • University of Chinese Academy of Sciences
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the 'Pre-training and Fine-tuning' paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this article, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method.

Original languageEnglish
Pages (from-to)837-851
Number of pages15
JournalIEEE Transactions on Multimedia
Volume26
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • Knowledge Distillation
  • Multi-modal Foundation Model
  • Out-of-Distribution Generalization
  • Visual Question Answering

Fingerprint

Dive into the research topics of 'Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual Question Answering'. Together they form a unique fingerprint.

Cite this