Skip to main navigation Skip to search Skip to main content

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

  • Bo Wang
  • , Zhao Zhang*
  • , Suiyi Zhao
  • , Haijun Zhang*
  • , Richang Hong
  • , Meng Wang
  • *Corresponding author for this work
  • Hefei University of Technology
  • Harbin Institute of Technology Shenzhen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages1750-1758
Number of pages9
ISBN (Electronic)9798400701085
DOIs
StatePublished - 27 Oct 2023
Externally publishedYes
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • embedding cross-partition dependency
  • feature redundancy
  • image captioning

Fingerprint

Dive into the research topics of 'CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning'. Together they form a unique fingerprint.

Cite this