Skip to main navigation Skip to search Skip to main content

Toward Multi-Modal Conditioned Fashion Image Translation

  • Xiaoling Gu
  • , Jun Yu*
  • , Yongkang Wong
  • , Mohan S. Kankanhalli
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Having the capability to synthesize photo-realistic fashion product images conditioned on multiple attributes or modalities would bring many new exciting applications. In this work, we propose an end-to-end network architecture that built upon a new generative adversarial network for automatically synthesizing photo-realistic images of fashion products under multiple conditions. Given an input pose image that consists of a 2D skeleton pose and a sentence description of products, our model synthesizes a fashion image preserving the same pose and wearing the fashion products described as the text. Specifically, the generator G tries to generate realistic-looking fashion images based on a langle mathsf {pose}, mathsf {text} rangle pair condition to fool the discriminator. An attention network is added for enhancing the generator, which predicts a probability map indicating which part of the image needs to be attended for translation. In contrast, the discriminator D distinguishes real images from the translated ones based on the input pose image and text description. The discriminator is divided into two multi-scale sub-discriminators for improving image distinguishing task. Quantitative and qualitative analysis demonstrates that our method is capable of synthesizing realistic images that retain the poses of given images while matching the semantics of provided sentence descriptions.

Original languageEnglish
Article number9141513
Pages (from-to)2361-2371
Number of pages11
JournalIEEE Transactions on Multimedia
Volume23
DOIs
StatePublished - 2021
Externally publishedYes

Keywords

  • Generative adversarial network
  • fashion image synthesis
  • image-to-image translation

Fingerprint

Dive into the research topics of 'Toward Multi-Modal Conditioned Fashion Image Translation'. Together they form a unique fingerprint.

Cite this