Recognition: no theorem link
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3
The pith
CLIPoint3D adapts a frozen CLIP model to few-shot 3D point cloud domain adaptation via depth map projections and alignment losses
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLIPoint3D is the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP; it projects points to multiple depth maps, applies knowledge-driven prompt tuning and entropy-guided sampling, then uses optimal transport alignment plus uncertainty-aware prototype alignment to close synthetic-to-real gaps, yielding 3-16 percent accuracy gains on PointDA-10 and GraspNetPC-10.
What carries the argument
Projection of 3D point clouds into multiple depth maps processed by a frozen CLIP backbone, refined by knowledge-driven prompt tuning that mixes language priors with cues from a lightweight 3D encoder, plus entropy-guided view sampling and optimal transport alignment losses.
If this is right
- Domain adaptation for 3D data becomes possible with far less computation than retraining full encoders.
- Language priors from VLMs can substitute for some 3D-specific training signals during adaptation.
- Class boundaries remain intact when alignment losses are combined with prototype methods.
- The approach scales to real-world settings like robotics where synthetic data is abundant but real labels are scarce.
Where Pith is reading between the lines
- Similar depth map projections could be tested on other 3D representations such as meshes or voxels to check if the same gains appear.
- The framework hints that vision-language priors might reduce reliance on purely geometric 3D training for cross-domain tasks.
- A direct extension would measure performance when the number of shots drops to one or zero to see how far language guidance can stretch.
- This opens a route to apply the same idea to 3D segmentation or object detection without new large-scale 3D pre-training.
Load-bearing premise
Projecting 3D points to depth maps together with entropy-guided sampling and optimal transport alignment will reliably close the domain gap while keeping classes separable even when the CLIP backbone stays largely frozen.
What would settle it
A benchmark test on scenes where depth map projections lose critical geometric detail, such as heavy occlusion, showing no accuracy gain or outright loss compared with standard baselines.
Figures
read the original abstract
Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Project page: https://sarthakm320.github.io/CLIPoint3D.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP. It projects point clouds to multiple depth maps, applies knowledge-driven prompt tuning that combines language priors with cues from a lightweight 3D encoder, uses entropy-guided view sampling, parameter-efficient fine-tuning, and combines optimal-transport alignment with uncertainty-aware prototype alignment to bridge source-target gaps while preserving class separability. Experiments on PointDA-10 and GraspNetPC-10 report consistent 3-16% accuracy gains over CLIP-based and conventional encoder-based baselines.
Significance. If the reported gains prove robust, the work would be significant as the first demonstration that a largely frozen CLIP backbone, augmented only by prompt tuning, depth-map projection, and standard alignment losses, can close synthetic-to-real 3D domain gaps in a few-shot unsupervised setting. This offers a parameter-efficient alternative to training heavy 3D encoders from scratch and opens a path for language-grounded 3D adaptation.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.
- [§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.
minor comments (2)
- [§3.1] §3.1: the precise architecture of the lightweight 3D encoder and the exact prompt-tuning parameters should be stated explicitly (e.g., number of learnable tokens and which CLIP layers are updated).
- [Tables 1-2] Table 1 and Table 2: add a row or column reporting the number of trainable parameters and inference-time FLOPs relative to the baselines to substantiate the efficiency advantage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the experimental reporting and methodological details.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.
Authors: We agree that the current presentation would benefit from greater statistical rigor. In the revised manuscript we will report all main results as means and standard deviations over five random seeds, include expanded ablation tables that vary the entropy threshold and number of views, and add paired t-tests (or equivalent) against the strongest baselines to establish significance. These additions will directly address concerns about stability and hyper-parameter sensitivity. revision: yes
-
Referee: [§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.
Authors: We will expand §3.2 with the precise entropy formula, the fixed projection count (eight depth maps per sample), and the source-only procedure used to set the threshold. The threshold is chosen on source validation data to retain a minimum number of low-entropy views and is never adjusted using target samples or labels. Revised text will include pseudocode and an explicit statement confirming that no target-domain information influences the sampling, thereby preserving the unsupervised protocol. revision: yes
Circularity Check
No significant circularity; minor self-citation not load-bearing
full rationale
The derivation chain relies on standard components (depth-map projection of point clouds, entropy-guided view sampling, prompt tuning of frozen CLIP, optimal transport alignment, and uncertainty-aware prototype alignment) whose loss formulations and motivations are independent of the final reported accuracy gains. No equation reduces a prediction to a fitted input by construction, no uniqueness theorem is imported from the same authors to force the architecture, and no ansatz is smuggled via self-citation. The 3-16% gains are presented as empirical outcomes on standard benchmarks rather than tautological re-statements of the alignment objectives. A score of 2 accounts for routine self-citation of prior CLIP adaptation work that is not load-bearing for the central claim.
Axiom & Free-Parameter Ledger
free parameters (2)
- prompt tuning parameters
- entropy threshold for view sampling
axioms (2)
- domain assumption Depth-map projections preserve sufficient geometric information for CLIP to reason about 3D objects
- domain assumption Optimal transport and uncertainty-aware prototype alignment close distribution gaps without destroying class separability
Forward citations
Cited by 1 Pith paper
-
BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
BioVLM achieves state-of-the-art cross-modality generalization on biomedical VLMs by learning a prompt bank and routing inputs to the most discriminative prompts via low-entropy selection plus LLM distillation.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 2, 12, 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Self- supervised learning for domain adaptation on point clouds
Idan Achituve, Haggai Maron, and Gal Chechik. Self- supervised learning for domain adaptation on point clouds. InProceedings of the IEEE/CVF winter conference on ap- plications of computer vision, pages 123–133, 2021. 2, 7
work page 2021
-
[3]
Fan Bai, Alan Ritter, and Wei Xu. Pre-train or annotate? do- main adaptation with a constrained budget.Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, 2021. 1
work page 2021
-
[4]
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adapta- tion.Advances in neural information processing systems, 19, 2006. 6
work page 2006
-
[5]
Fabio Bracci, Martin Drauschke, Stefan K ¨uhne, and Z- C M ´arton. Challenges in fusion of heterogeneous point clouds.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:155– 162, 2018. 1
work page 2018
-
[6]
ShapeNet: An Information-Rich 3D Model Repository
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015. 6, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Canonical shape projection is all you need for 3d few-shot class incremental learning
Ali Cheraghian, Zeeshan Hayder, Sameera Ramasinghe, Shafin Rahman, Javad Jafaryahya, Lars Petersson, and Mehrtash Harandi. Canonical shape projection is all you need for 3d few-shot class incremental learning. InEuropean Conference on Computer Vision, pages 36–53. Springer,
-
[8]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 6, 12, 13
work page 2017
-
[9]
Jiuqian Dai, Zhenyan Ji, Zechang Xiong, Guiping Zhu, Hui Liu, Shen Yin, and Jose Enrique Armendariz-Inigo. Mvf- pointclip: Training-free multi-view fusion pointclip for zero- shot 3d classification.Neurocomputing, 653:131188, 2025. 2
work page 2025
-
[10]
Domain-agnostic mutual prompting for unsuper- vised domain adaptation
Zhekai Du, Xinyao Li, Fengling Li, Ke Lu, Lei Zhu, and Jingjing Li. Domain-agnostic mutual prompting for unsuper- vised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23375–23384, 2024. 3
work page 2024
-
[11]
Graspnet-1billion: A large-scale benchmark for general ob- ject grasping
Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444– 11453, 2020. 6, 12, 13
work page 2020
-
[12]
Rethinking few-shot adaptation of vision- language models in two stages
Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29989–29998, 2025. 7
work page 2025
-
[13]
Unsupervised domain adaptation by backpropagation
Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational conference on machine learning, pages 1180–1189. PMLR, 2015. 1
work page 2015
-
[14]
Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35. 1, 2, 5, 7, 12, 14, 15
-
[15]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 14
work page 2024
-
[16]
Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 2023. 3
work page 2023
-
[17]
Revisiting point cloud shape classification with a simple and effective baseline
Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational conference on machine learning, pages 3809–3820. PMLR, 2021. 4, 12
work page 2021
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2, 12, 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 1
work page 2018
-
[20]
Unsupervised multi-task feature learning on point clouds
Kaveh Hassani and Mike Haley. Unsupervised multi-task feature learning on point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 8160–8171, 2019. 7
work page 2019
-
[21]
Weinan He, Yixin Zhang, and Zilei Wang. Progressive dis- tribution bridging: Unsupervised adaptation for large-scale pre-trained models via adaptive auxiliary data. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3280–3292, 2025. 3
work page 2025
-
[22]
Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition
Deepti Hegde, Jeya Maria Jose Valanarasu, and Vishal Pa- tel. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2028– 2038, 2023. 2, 3
work page 2028
-
[23]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 8
work page 2017
-
[24]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 7
work page 2022
-
[25]
Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training
Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2
work page 2023
-
[26]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 4
work page 2022
-
[27]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muham- mad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 4, 7
work page 2023
-
[28]
How to adapt your large-scale vision-and-language model
Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak Pathak. How to adapt your large-scale vision-and-language model. 2021. 7
work page 2021
-
[29]
Resource efficient 3d convolutional neural networks
Okan Kopuklu, Neslihan Kose, Ahmet Gunduz, and Gerhard Rigoll. Resource efficient 3d convolutional neural networks. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1
work page 2019
-
[30]
Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation
Zhengfeng Lai, Noranart Vesdapunt, Ning Zhou, Jun Wu, Cong Phuoc Huynh, Xuelu Li, Kah Kuen Fu, and Chen-Nee Chuah. Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16155–16165, 2023. 3
work page 2023
-
[31]
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 4
work page 2021
-
[32]
Semantic concentration for domain adaptation
Shuang Li, Mixue Xie, Fangrui Lv, Chi Harold Liu, Jian Liang, Chen Qin, and Wei Li. Semantic concentration for domain adaptation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9102–9111,
-
[33]
Split to merge: Unifying separated modali- ties for unsupervised domain adaptation
Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, and Jingjing Li. Split to merge: Unifying separated modali- ties for unsupervised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23364–23374, 2024. 3
work page 2024
-
[34]
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), 2021. 4
work page 2021
-
[35]
Scaling down to scale up: A guide to parameter-efficient fine-tuning
Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arxiv 2023.arXiv preprint arXiv:2303.15647, 6, 2023. 2
-
[36]
Point cloud do- main adaptation via masked local 3d structure prediction
Hanxue Liang, Hehe Fan, Zhiwen Fan, Yi Wang, Tianlong Chen, Yu Cheng, and Zhangyang Wang. Point cloud do- main adaptation via masked local 3d structure prediction. In European conference on computer vision, pages 156–172. Springer, 2022. 1, 2, 7
work page 2022
-
[37]
Learning transferable features with deep adaptation net- works
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor- dan. Learning transferable features with deep adaptation net- works. InInternational conference on machine learning, pages 97–105. PMLR, 2015. 1, 5, 8
work page 2015
-
[38]
Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adapta- tion.Advances in neural information processing systems, 31, 2018. 1, 12, 14, 15
work page 2018
-
[39]
Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008. 8
work page 2008
-
[40]
Visual classification via description from large language models
Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InInternational Conference on Learning Representations, 2023. 13
work page 2023
-
[41]
Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024
Munish Monga, Sachin Kumar Giroh, Ankit Jha, Mainak Singha, Biplab Banerjee, and Jocelyn Chanussot. Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024. 1
-
[42]
Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025
OpenAI. Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025. August 7, 2025. 2, 4, 6, 12, 13, 16
work page 2025
-
[43]
Johannes Otepka, Sajid Ghuffar, Christoph Waldhauser, Ronald Hochreiter, and Norbert Pfeifer. Georeferenced point clouds: A survey of features and point cloud manage- ment.ISPRS International Journal of Geo-Information, 2 (4):1038–1065, 2013. 1
work page 2013
-
[44]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[45]
Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.Advances in Neural In- formation Processing Systems, 32, 2019. 1, 2, 6, 7, 13
work page 2019
-
[46]
Improving lan- guage understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Sal- imans, Ilya Sutskever, et al. Improving lan- guage understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf,
-
[47]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7, 15
work page 2021
-
[48]
Taki Hasan Rafi, Ratul Mahjabin, Emon Ghosh, Young- Woong Ko, and Jeong-Gun Lee. Domain generalization for semantic segmentation: a survey.Artificial Intelligence Re- view, 57(9):247, 2024. 1
work page 2024
-
[49]
The- oretical analysis of domain adaptation with optimal trans- port
Ievgen Redko, Amaury Habrard, and Marc Sebban. The- oretical analysis of domain adaptation with optimal trans- port. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer, 2017. 6
work page 2017
-
[50]
A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951
Herbert Robbins and Sutton Monro. A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951. 6
work page 1951
-
[51]
Jonathan Sauder and Bjarne Sievers. Self-supervised deep learning on point clouds by reconstructing space.Advances in neural information processing systems, 32, 2019. 2, 7
work page 2019
-
[52]
Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification
Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3596–3605, 2024. 2
work page 2024
-
[53]
Domain adaptation on point clouds via geometry-aware implicits
Yuefan Shen, Yanchao Yang, Mi Yan, He Wang, Youyi Zheng, and Leonidas J Guibas. Domain adaptation on point clouds via geometry-aware implicits. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7223–7232, 2022. 2, 6, 7, 13
work page 2022
-
[54]
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 2
work page 2022
-
[55]
Ad-clip: Adapting domains in prompt space using clip
Mainak Singha, Harsh Pal, Ankit Jha, and Biplab Banerjee. Ad-clip: Adapting domains in prompt space using clip. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4355–4364, 2023. 3
work page 2023
-
[56]
Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, and Elisa Ricci. Fedmvp: Federated multi-modal visual prompt tuning for vision-language models.Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 7, 13
work page 2025
-
[57]
Multi-view convolutional neural networks for 3d shape recognition
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,
-
[58]
Adversarial discriminative domain adaptation
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017. 1
work page 2017
-
[59]
Learning discriminative features by covering local geometric space for point cloud analysis
Changshuo Wang, Xin Ning, Linjun Sun, Liping Zhang, Weijun Li, and Xiao Bai. Learning discriminative features by covering local geometric space for point cloud analysis. IEEE Transactions on Geoscience and Remote Sensing, 60: 1–15, 2022. 1
work page 2022
-
[60]
Qi Wang, Jian Chen, Jianqiang Deng, and Xinfang Zhang. 3d-centernet: 3d object detection network for point clouds with center estimation priority.Pattern Recognition, 115: 107884, 2021. 1
work page 2021
-
[61]
Ruibin Wang, Xianghua Ying, Bowei Xing, Xin Tong, Taiyan Chen, Jinfa Yang, and Yongjie Shi. Improving point cloud classification and segmentation via parametric veronese mapping.Pattern Recognition, 144:109784, 2023. 1
work page 2023
-
[62]
Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019. 1
work page 2019
-
[63]
3d shapenets: A deep representation for volumetric shapes
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 6, 12, 13
work page 1912
-
[64]
Tuo Xiang, Xuemiao Xu, Bangzhen Liu, Jinyi Li, Yong Li, and Shengfeng He. Seeing 3d through 2d lenses: 3d few-shot class-incremental learning via cross-modal geometric rectifi- cation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6761–6771, 2025. 2, 3
work page 2025
-
[65]
Wan Xu, Tianyu Huang, Tianyuan Qu, Guanglei Yang, Yi- wen Guo, and Wangmeng Zuo. Filp-3d: Enhancing 3d few-shot class-incremental learning with pre-trained vision- language models.Pattern Recognition, 165:111558, 2025. 2, 3
work page 2025
-
[66]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 2, 12, 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bit- fit: Simple parameter-efficient fine-tuning for transformer- based masked language-models.Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), 2021. 7
work page 2021
-
[68]
Low-rank few-shot adaptation of vision-language models
Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 7
work page 2024
-
[69]
Clip2: Contrastive language- image-point pretraining from real-world point cloud data
Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language- image-point pretraining from real-world point cloud data. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15244–15253, 2023. 3
work page 2023
-
[70]
Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,
Huang Zhang, Xin Ning, Changshuo Wang, Enhao Ning, and Lusi Li. Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,
-
[71]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022. 2, 4, 5, 7, 15
work page 2022
-
[72]
Factual probing is [mask]: Learning vs
Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to recall.Proceed- ings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, 2021. 4
work page 2021
-
[73]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
-
[74]
Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2639–2650, 2023. 2, 7, 15
work page 2023
-
[75]
Geometry- aware self-training for unsupervised domain adaptation on object point clouds
Longkun Zou, Hui Tang, Ke Chen, and Kui Jia. Geometry- aware self-training for unsupervised domain adaptation on object point clouds. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6403–6412,
-
[76]
1, 2, 7 A. Supplementary Contents In this supplementary document, we present detailed infor- mation and further experimental results, including: 1.Dataset descriptions:In Table A1, we provide the to- tal number of point cloud samples in the training and test splits of each domain of each datasets, though we do few-shot training in our proposed method. 2.L...
-
[77]
into our CLIP-based baselines (Zs-CLIP, PointCLIP & PointCLIPv2) to train them for point-cloud UDA task. While these conventional UDA methods can bring modest improvements in certain cross-domain transfers, their gains are inconsistent and often fail to fully bridge the domain gap inherent in 3D point cloud data. To be noted, we have just added a learnabl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.