Skip to main navigation Skip to search Skip to main content

Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation

  • Harbin Institute of Technology
  • Harbin Institute of Technology Shenzhen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we propose a stable diffusion-based imagination network integrated into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing multimodal MT. Particularly, we build heuristic feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of visual information, which breaks the high-cost bottleneck of image annotation in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 12 BLEU points on Multi30K and MSCOCO multimodal MT benchmarks.

Original languageEnglish
Title of host publicationLong Papers
EditorsWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
PublisherAssociation for Computational Linguistics (ACL)
Pages26567-26583
Number of pages17
ISBN (Electronic)9798891762510
DOIs
StatePublished - 2025
Event63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria
Duration: 27 Jul 20251 Aug 2025

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
Volume1
ISSN (Print)0736-587X

Conference

Conference63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Country/TerritoryAustria
CityVienna
Period27/07/251/08/25

Fingerprint

Dive into the research topics of 'Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation'. Together they form a unique fingerprint.

Cite this