Skip to main navigation Skip to search Skip to main content

CKDH: CLIP-Based Knowledge Distillation Hashing for Cross-Modal Retrieval

  • Jiaxing Li
  • , Wai Keung Wong*
  • , Lin Jiang
  • , Xiaozhao Fang
  • , Shengli Xie
  • , Yong Xu
  • *Corresponding author for this work
  • Guangzhou University
  • Hong Kong Polytechnic University
  • Laboratory for Artificial Intelligence in Design
  • Guangdong University of Technology
  • Ministry of Education of the People's Republic of China
  • Harbin Institute of Technology Shenzhen
  • Pengcheng Laboratory

Research output: Contribution to journalArticlepeer-review

Abstract

Recently, deep hashing-based cross-modal retrieval has attracted much attention of researchers, due to its advantages of fast retrieval efficiency and low storage overhead, etc. However, the existing deep hashing-based cross-modal retrieval methods typically 1) suffer from inadequately capturing the semantic relevance and coexistent information for cross-modal data, which may result in sub-optimal retrieval performance, 2) require a more comprehensive similarity measurement for cross-modal features to ensure high retrieval accuracy, 3) lack of scalability for lightweight deployment framework. To handle the issues mentioned above, we propose a CLIP-based knowledge distillation hashing (CKDH) for cross-modal retrieval, by referring the research trend of combining traditional methods and modern neural architecture to design lightweight networks based on large language models. Specifically, to effectively help capture the semantic relevance and coexistent information, CLIP is fine-tuned to extract visual features, while a graph attention network is used to enhance textual features extracted by bag-of-words model in the teacher model. Then, for better supervising the training of student model, a more comprehensive similarity measurement is introduced to represent distilled knowledge by jointly preserving the log-likelihood, intra and inter modality similarities. Finally, the student model extracts deep features by a lightweight networks, and generates the hash codes under the supervision of the similarity matrix produced by the teacher model. Experimental results on three widely used datasets demonstrate that CKDH can outperform some state-of-the-art methods, by delivering the best result consistently.

Original languageEnglish
Pages (from-to)6530-6541
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number7
DOIs
StatePublished - 2024
Externally publishedYes

Keywords

  • Cross-modal retrieval
  • contrastive language-image pre-training
  • deep hashing
  • knowledge distillation

Fingerprint

Dive into the research topics of 'CKDH: CLIP-Based Knowledge Distillation Hashing for Cross-Modal Retrieval'. Together they form a unique fingerprint.

Cite this