TY - GEN
T1 - Distilled Dual-Encoder Model for Vision-Language Understanding
AU - Wang, Zekun
AU - Wang, Wenhui
AU - Zhu, Haichao
AU - Liu, Ming
AU - Qin, Bing
AU - Wei, Furu
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual-encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DIDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DIDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4× faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.
AB - On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual-encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DIDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DIDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4× faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.
UR - https://www.scopus.com/pages/publications/85149441327
U2 - 10.18653/v1/2022.emnlp-main.608
DO - 10.18653/v1/2022.emnlp-main.608
M3 - 会议稿件
AN - SCOPUS:85149441327
T3 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
SP - 8901
EP - 8913
BT - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
A2 - Goldberg, Yoav
A2 - Kozareva, Zornitsa
A2 - Zhang, Yue
PB - Association for Computational Linguistics (ACL)
T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Y2 - 7 December 2022 through 11 December 2022
ER -