Skip to main navigation Skip to search Skip to main content

Text to Image Generation with Bidirectional Multiway Transformers

  • Hangbo Bao
  • , Li Dong
  • , Songhao Piao*
  • , Furu Wei
  • *Corresponding author for this work
  • Harbin Institute of Technology
  • Microsoft USA

Research output: Contribution to journalArticlepeer-review

Abstract

In this study, we explore the potential of Multiway Transformers for text-to-image generation to achieve performance improvements through a concise and efficient decoupled model design and the inference efficiency provided by bidirectional encoding. We propose a method for improving the image tokenizer using pretrained Vision Transformers. Next, we employ bidirectional Multiway Transformers to restore the masked visual tokens combined with the unmasked text tokens. On the MS-COCO benchmark, our Multiway Transformers outperform vanilla Transformers, achieving superior FID scores and confirming the efficacy of the modality-specific parameter computation design. Ablation studies reveal that the fusion of visual and text tokens in bidirectional encoding contributes to improved model performance. Additionally, our proposed tokenizer outperforms VQGAN in image reconstruction quality and enhances the text-to-image generation results. By incorporating the additional CC-3M dataset for intermediate finetuning on our model with 688M parameters, we achieve competitive results with a finetuned FID score of 4.98 on MS-COCO.

Original languageEnglish
Pages (from-to)405-422
Number of pages18
JournalComputational Visual Media
Volume11
Issue number2
DOIs
StatePublished - 2025

Keywords

  • Transformer
  • VQ-VAE
  • generative models
  • text to image generation

Fingerprint

Dive into the research topics of 'Text to Image Generation with Bidirectional Multiway Transformers'. Together they form a unique fingerprint.

Cite this