Abstract
The remarkable success of Contrastive Language-Image Pretraining (CLIP) in zero-shot 2D vision classification inspires researchers to explore its potential application to zero-shot 3D classification. Some researchers project 3D point clouds into 2D images from multiple views to leverage CLIP. However, by doing experiments, we find that this method suffers from two critical drawbacks: (1) noise views existing in multiview depth maps, which provide limited information that may mislead classification; (2) covariance inconsistencies between sample views, which can lead to misclassification when using cosine similarity. To address these issues, we propose a training-free MultiView Fusion PointCLIP (MVF-PointCLIP). It contains a Spatial and Frequency Attention (SFA) module and a Mahalanobis Distance module designed by us. The SFA module automatically assigns importance weights to views, effectively filtering out noisy information. The Mahalanobis Distance module models the distribution of views to tackle covariance inconsistencies. Experimental results verify the superiority of MVF-PointCLIP to SOTA models in zero-shot classification across ModelNet10, ModelNet40, and ScanObjectNN.
| Original language | English |
|---|---|
| Article number | 131188 |
| Journal | Neurocomputing |
| Volume | 653 |
| DOIs | |
| State | Published - 7 Nov 2025 |
| Externally published | Yes |
Keywords
- CLIP
- Multi-view fusion
- Point cloud
- Training-free
- Zero-shot
Fingerprint
Dive into the research topics of 'MVF-PointCLIP: Training-free multi-view fusion PointCLIP for zero-shot 3D classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver