Skip to main navigation Skip to search Skip to main content

VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

  • Hangbo Bao
  • , Wenhui Wang
  • , Li Dong
  • , Qiang Liu
  • , Owais Khan Mohammed
  • , Kriti Aggarwal
  • , Subhojit Som
  • , Songhao Piao
  • , Furu Wei
  • Harbin Institute of Technology
  • Microsoft USA

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present a unified Vision-Language pretrained Model (VLMO) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of Multiway Transformer, pretrained VLMO can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMO achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at http://aka.ms/vlmo.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
EditorsS. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
PublisherNeural information processing systems foundation
ISBN (Electronic)9781713871088
StatePublished - 2022
Event36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States
Duration: 28 Nov 20229 Dec 2022

Publication series

NameAdvances in Neural Information Processing Systems
Volume35
ISSN (Print)1049-5258

Conference

Conference36th Conference on Neural Information Processing Systems, NeurIPS 2022
Country/TerritoryUnited States
CityNew Orleans
Period28/11/229/12/22

Fingerprint

Dive into the research topics of 'VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts'. Together they form a unique fingerprint.

Cite this