Skip to main navigation Skip to search Skip to main content

Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering

  • Zhou Yu
  • , Zitian Jin
  • , Jun Yu
  • , Mingliang Xu
  • , Hongbo Wang*
  • , Jianping Fan
  • *Corresponding author for this work
  • Hangzhou Dianzi University
  • Zhengzhou University
  • Lenovo

Research output: Contribution to journalArticlepeer-review

Abstract

Recent Transformer architectures (Vaswani et al., 2017) have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN (Yu et al., 2019), UNITER (Chen et al., 2020), and CLIP-ViL (Shen et al., 2021), and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCANBST submodel achieves comparable accuracy on VQA-v2, while being 0.38× smaller in model size and having 0.27× fewer FLOPs than the reference MCAN model. The smallest MCANBST submodel only has 9 M parameters and 0.16 G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.

Original languageEnglish
Pages (from-to)9543-9556
Number of pages14
JournalIEEE Transactions on Multimedia
Volume25
DOIs
StatePublished - 2023
Externally publishedYes

Keywords

  • Visual question answering
  • efficient deep learning
  • multimodal learning
  • slimmable network
  • transformer

Fingerprint

Dive into the research topics of 'Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering'. Together they form a unique fingerprint.

Cite this