Skip to main navigation Skip to search Skip to main content

An Image Captioning Algorithm Based on Combination Attention Mechanism

  • Jinlong Liu*
  • , Kangda Cheng
  • , Haiyan Jin
  • , Zhilu Wu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

With the maturity of computer vision and natural language processing technology, we are becoming more ambitious in image captioning. In particular, we are more ambitious in generating longer, richer, and more accurate sentences as image descriptions. Most existing image caption models use an encoder—decoder structure, and most of the best-performing models incorporate attention mechanisms in the encoder—decoder structure. However, existing image captioning methods focus only on visual attention mechanism and not on keywords attention mechanism, thus leading to model-generated sentences that are not rich and accurate enough, and errors in visual feature extraction can directly lead to generated caption sentences that are incorrect. To fill this gap, we propose a combination attention module. This module comprises a visual attention module and a keyword attention module. The visual attention module helps in performing fast extractions of key local features, and the keyword attention module focuses on keywords that may appear in generated sentences. The results generated by the two modules can be corrected for each other. We embed the combination attention module into the framework of the Transformer, thus constructing a new image caption model CAT (Combination Attention Transformer) to generate more accurate and rich image caption sentences. Extensive experiments on the MSCOCO dataset demonstrate the effectiveness and superiority of our method over many state-of-the-art methods.

Original languageEnglish
Article number1397
JournalElectronics (Switzerland)
Volume11
Issue number9
DOIs
StatePublished - 1 May 2022
Externally publishedYes

Keywords

  • attention mechanism
  • image caption
  • multimodal processing

Fingerprint

Dive into the research topics of 'An Image Captioning Algorithm Based on Combination Attention Mechanism'. Together they form a unique fingerprint.

Cite this