Skip to main navigation Skip to search Skip to main content

Distilled Dual-Encoder Model for Vision-Language Understanding

  • Zekun Wang
  • , Wenhui Wang
  • , Haichao Zhu
  • , Ming Liu*
  • , Bing Qin
  • , Furu Wei
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual-encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DIDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DIDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4× faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.

Original languageEnglish
Title of host publicationProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
EditorsYoav Goldberg, Zornitsa Kozareva, Yue Zhang
PublisherAssociation for Computational Linguistics (ACL)
Pages8901-8913
Number of pages13
ISBN (Electronic)9781959429401
DOIs
StatePublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Hybrid, Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022

Publication series

NameProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityHybrid, Abu Dhabi
Period7/12/2211/12/22

Fingerprint

Dive into the research topics of 'Distilled Dual-Encoder Model for Vision-Language Understanding'. Together they form a unique fingerprint.

Cite this