Skip to main navigation Skip to search Skip to main content

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

  • Harbin Institute of Technology Shenzhen
  • Guangdong University of Technology
  • Minjiang University

Research output: Contribution to journalConference articlepeer-review

Abstract

Large-scale pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities in image recognition tasks. Recent approaches typically employ supervised fine-tuning methods to adapt CLIP for zero-shot multi-label image recognition tasks. However, obtaining sufficient multi-label annotated image data for training is challenging and not scalable. In this paper, we propose a new language-driven framework for zero-shot multi-label recognition that eliminates the need for annotated images during training. Leveraging the aligned CLIP multi-modal embedding space, our method utilizes language data generated by LLMs to train a cross-modal classifier, which is subsequently transferred to the visual modality. During inference, directly applying the classifier to visual inputs may limit performance due to the modality gap. To address this issue, we introduce a cross-modal mapping method that maps image embeddings to the language modality while retaining crucial visual information. Comprehensive experiments demonstrate that our method outperforms other zero-shot multi-label recognition methods and achieves competitive results compared to few-shot methods.

Original languageEnglish
Pages (from-to)32173-32183
Number of pages11
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition'. Together they form a unique fingerprint.

Cite this