Skip to main navigation Skip to search Skip to main content

Comprehensive Linguistic-Visual Composition Network for Image Retrieval

  • Haokun Wen
  • , Xuemeng Song
  • , Xin Yang
  • , Yibing Zhan
  • , Liqiang Nie
  • Shandong University
  • JD Explore Academy

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Composing text and image for image retrieval (CTI-IR) is a new yet challenging task, for which the input query is not the conventional image or text but a composition, i.e., a reference image and its corresponding modification text. The key of CTI-IR lies in how to properly compose the multi-modal query to retrieve the target image. In a sense, pioneer studies mainly focus on composing the text with either the local visual descriptor or global feature of the reference image. However, they overlook the fact that the text modifications are indeed diverse, ranging from the concrete attribute changes, like "change it to long sleeves", to the abstract visual property adjustments, e.g., "change the style to professional". Thus, simply emphasizing the local or global feature of the reference image for the query composition is insufficient. In light of the above analysis, we propose a Comprehensive Linguistic-Visual Composition Network (CLVC-Net) for image retrieval. The core of CLVC-Net is that it designs two composition modules: fine-grained local-wise composition module and fine-grained global-wise composition module, targeting comprehensive multi-modal compositions. Additionally, a mutual enhancement module is designed to promote local-wise and global-wise composition processes by forcing them to share knowledge with each other. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our CLVC-Net. We released the codes to benefit other researchers.

Original languageEnglish
Title of host publicationSIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1369-1378
Number of pages10
ISBN (Electronic)9781450380379
DOIs
StatePublished - 11 Jul 2021
Externally publishedYes
Event44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021 - Virtual, Online, Canada
Duration: 11 Jul 202115 Jul 2021

Publication series

NameSIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021
Country/TerritoryCanada
CityVirtual, Online
Period11/07/2115/07/21

Keywords

  • image retrieval
  • linguistic-visual composition
  • mutual learning

Fingerprint

Dive into the research topics of 'Comprehensive Linguistic-Visual Composition Network for Image Retrieval'. Together they form a unique fingerprint.

Cite this