Skip to main navigation Skip to search Skip to main content

TPTE: Text-Guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning

  • Shunan Mao
  • , Hao Chen
  • , Yaowei Wang
  • , Wei Zeng
  • , Shiliang Zhang*
  • *Corresponding author for this work
  • Peking University
  • Peng Cheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in pre-trained vision-language models have successfully boosted the performance of unsupervised image representation in many vision tasks. Most of existing works focus on learning global visual features with Transformers and neglect detailed local cues, leading to suboptimal performance in fine-grained vision tasks. In this article, we propose a text-guided patch token exploitation framework to enhance the discriminative power of unsupervised representation by exploiting more detailed local features. Our text-guided decoder extracts local features with the guidance of texts or learned prompts describing discriminative object parts. We hence introduce a local-global relation distillation loss to promote the joint optimization of local and global features. The proposed method allows to flexibly extract either global or global-local features as the image representation. It significantly outperforms previous methods in fine-grained image retrieval and base-to-new fine-grained classification tasks. For instance, our Recall@1 metric surpasses the recent unsupervised retrieval method STML by 6.0% on the SOP dataset. The code is publicly available at https://github.com/maosnhehe/TPTE.

Original languageEnglish
Article number352
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume20
Issue number11
DOIs
StatePublished - 13 Nov 2024
Externally publishedYes

Keywords

  • Cross modal
  • Fine-grained
  • Image Retrieval

Fingerprint

Dive into the research topics of 'TPTE: Text-Guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning'. Together they form a unique fingerprint.

Cite this