Abstract
Recent Transformer architectures (Vaswani et al., 2017) have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN (Yu et al., 2019), UNITER (Chen et al., 2020), and CLIP-ViL (Shen et al., 2021), and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCANBST submodel achieves comparable accuracy on VQA-v2, while being 0.38× smaller in model size and having 0.27× fewer FLOPs than the reference MCAN model. The smallest MCANBST submodel only has 9 M parameters and 0.16 G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.
| Original language | English |
|---|---|
| Pages (from-to) | 9543-9556 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 25 |
| DOIs | |
| State | Published - 2023 |
| Externally published | Yes |
Keywords
- Visual question answering
- efficient deep learning
- multimodal learning
- slimmable network
- transformer
Fingerprint
Dive into the research topics of 'Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver