Abstract
Pretrained cross-modal models, such as the representative CLIP model, have recently led to a boom in the use of pretrained models for cross-modal zero-shot tasks due to their strong generalization abilities. However, we experimentally discovered that CLIP suffers from text-to-image retrieval hallucination, which adversely limits its capabilities under zero-shot learning. Specifically, in retrieval tasks, CLIP often assigns the highest score to an incorrect image, even when it correctly understands the image's semantic content in classification tasks. Accordingly, we propose the Balanced Score with Auxiliary Prompts (BSAP) method to address this problem. BSAP introduces auxiliary prompts that provide multiple reference outcomes for each image retrieval task. These outcomes, derived from the image and the target text, are normalized to compute a final similarity score, thereby reducing hallucinations. We further combine the original results with BSAP to generate a more robust hybrid outcome, termed BSAP-H. Extensive experiments on Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS) tasks demonstrate that BSAP significantly improves the performance of CLIP and state-of-the-art vision-language models (VLMs). Code available at https://github.com/WangHanyao/BSAP.
| Original language | English |
|---|---|
| Article number | 132640 |
| Journal | Neurocomputing |
| Volume | 671 |
| DOIs | |
| State | Published - 28 Mar 2026 |
| Externally published | Yes |
Keywords
- CLIP
- Pretrained cross-modal models
- Text-to-image retrieval hallucination
- Zero-shot learning
Fingerprint
Dive into the research topics of 'Towards alleviating hallucination in text-to-image retrieval for CLIP in zero-shot learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver