TY - GEN
T1 - Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
AU - Chen, Andong
AU - Song, Yuchen
AU - Chen, Kehai
AU - Bai, Xuefeng
AU - Yang, Muyun
AU - Nie, Liqiang
AU - Liu, Jie
AU - Zhao, Tiejun
AU - Zhang, Min
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we propose a stable diffusion-based imagination network integrated into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing multimodal MT. Particularly, we build heuristic feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of visual information, which breaks the high-cost bottleneck of image annotation in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 12 BLEU points on Multi30K and MSCOCO multimodal MT benchmarks.
AB - Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we propose a stable diffusion-based imagination network integrated into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing multimodal MT. Particularly, we build heuristic feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of visual information, which breaks the high-cost bottleneck of image annotation in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 12 BLEU points on Multi30K and MSCOCO multimodal MT benchmarks.
UR - https://www.scopus.com/pages/publications/105021048207
U2 - 10.18653/v1/2025.acl-long.1289
DO - 10.18653/v1/2025.acl-long.1289
M3 - 会议稿件
AN - SCOPUS:105021048207
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 26567
EP - 26583
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Y2 - 27 July 2025 through 1 August 2025
ER -