Skip to main navigation Skip to search Skip to main content

MVF-PointCLIP: Training-free multi-view fusion PointCLIP for zero-shot 3D classification

  • Jiuqian Dai
  • , Zhenyan Ji*
  • , Zechang Xiong
  • , Guiping Zhu
  • , Hui Liu
  • , Shen Yin
  • , Jose Enrique Armendariz-Inigo
  • *Corresponding author for this work
  • Beijing Jiaotong University
  • Norwegian University of Science and Technology
  • Public University of Navarre

Research output: Contribution to journalArticlepeer-review

Abstract

The remarkable success of Contrastive Language-Image Pretraining (CLIP) in zero-shot 2D vision classification inspires researchers to explore its potential application to zero-shot 3D classification. Some researchers project 3D point clouds into 2D images from multiple views to leverage CLIP. However, by doing experiments, we find that this method suffers from two critical drawbacks: (1) noise views existing in multiview depth maps, which provide limited information that may mislead classification; (2) covariance inconsistencies between sample views, which can lead to misclassification when using cosine similarity. To address these issues, we propose a training-free MultiView Fusion PointCLIP (MVF-PointCLIP). It contains a Spatial and Frequency Attention (SFA) module and a Mahalanobis Distance module designed by us. The SFA module automatically assigns importance weights to views, effectively filtering out noisy information. The Mahalanobis Distance module models the distribution of views to tackle covariance inconsistencies. Experimental results verify the superiority of MVF-PointCLIP to SOTA models in zero-shot classification across ModelNet10, ModelNet40, and ScanObjectNN.

Original languageEnglish
Article number131188
JournalNeurocomputing
Volume653
DOIs
StatePublished - 7 Nov 2025
Externally publishedYes

Keywords

  • CLIP
  • Multi-view fusion
  • Point cloud
  • Training-free
  • Zero-shot

Fingerprint

Dive into the research topics of 'MVF-PointCLIP: Training-free multi-view fusion PointCLIP for zero-shot 3D classification'. Together they form a unique fingerprint.

Cite this