pith. machine review for the scientific record. sign in

arxiv: 2602.20409 · v2 · submitted 2026-02-23 · 💻 cs.CV · cs.LG

Recognition: no theorem link

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords 3D point cloud domain adaptationfew-shot unsupervised learningCLIP vision-language modeldepth map projectionoptimal transport alignmentprompt tuningsynthetic to real transfer
0
0 comments X

The pith

CLIPoint3D adapts a frozen CLIP model to few-shot 3D point cloud domain adaptation via depth map projections and alignment losses

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CLIPoint3D as a way to adapt vision-language models like CLIP for 3D point cloud tasks when data comes from different domains and only a few labels are available. It converts each 3D sample into several 2D depth map views that the pre-trained CLIP image encoder can process directly. Lightweight prompt tuning adds language knowledge and geometric cues, while entropy-guided sampling picks the most useful views. Optimal transport and prototype alignment then match source and target distributions without losing the ability to separate classes. The result is steady accuracy gains over prior methods on standard benchmarks, which matters for practical 3D applications that cannot afford to train large new encoders from scratch.

Core claim

CLIPoint3D is the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP; it projects points to multiple depth maps, applies knowledge-driven prompt tuning and entropy-guided sampling, then uses optimal transport alignment plus uncertainty-aware prototype alignment to close synthetic-to-real gaps, yielding 3-16 percent accuracy gains on PointDA-10 and GraspNetPC-10.

What carries the argument

Projection of 3D point clouds into multiple depth maps processed by a frozen CLIP backbone, refined by knowledge-driven prompt tuning that mixes language priors with cues from a lightweight 3D encoder, plus entropy-guided view sampling and optimal transport alignment losses.

If this is right

  • Domain adaptation for 3D data becomes possible with far less computation than retraining full encoders.
  • Language priors from VLMs can substitute for some 3D-specific training signals during adaptation.
  • Class boundaries remain intact when alignment losses are combined with prototype methods.
  • The approach scales to real-world settings like robotics where synthetic data is abundant but real labels are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar depth map projections could be tested on other 3D representations such as meshes or voxels to check if the same gains appear.
  • The framework hints that vision-language priors might reduce reliance on purely geometric 3D training for cross-domain tasks.
  • A direct extension would measure performance when the number of shots drops to one or zero to see how far language guidance can stretch.
  • This opens a route to apply the same idea to 3D segmentation or object detection without new large-scale 3D pre-training.

Load-bearing premise

Projecting 3D points to depth maps together with entropy-guided sampling and optimal transport alignment will reliably close the domain gap while keeping classes separable even when the CLIP backbone stays largely frozen.

What would settle it

A benchmark test on scenes where depth map projections lose critical geometric detail, such as heavy occlusion, showing no accuracy gain or outright loss compared with standard baselines.

Figures

Figures reproduced from arXiv: 2602.20409 by Biplab Banerjee, Elisa Ricci, Mainak Singha, Paolo Casari, Sarthak Mehrotra, Subhasis Chaudhuri.

Figure 1
Figure 1. Figure 1: Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10. Encoder-based 3D UDA methods (e.g., Point￾DAN [45], GAST [75], MLSP [36]) are accurate but computa￾tionally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap. CLIPoint3D achieves +16.4% improvement with minimal overhead. most assume identical training and deployment distribu￾tions. In practice, scans acquired from heter… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CLIPoint3D, the first CLIP-based unsupervised 3D point cloud domain adaptation framework, comprises four key modules: (1) Knowledge-driven prompt tuning generates LLM-guided textual and 3D-aware visual prompts; (2) Parameter-efficient fine-tuning (PEFT) jointly optimizes these prompts and the encoder while (3) entropy-based view selection filters unreliable projections; (4) Dual objectives, unc… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Effect of the number of labeled samples in Ds during training. (b) Effect of projected views. Accuracy variation with projection count M. (v) Influence of projected views. As shown in Figure 3b, increasing the number of 2D projections M enhances performance by enriching multi-view cues. However, we achieve the maximum accuracy at M=10, where additional views are found to add redundancy with minimal gai… view at source ↗
read the original abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Project page: https://sarthakm320.github.io/CLIPoint3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP. It projects point clouds to multiple depth maps, applies knowledge-driven prompt tuning that combines language priors with cues from a lightweight 3D encoder, uses entropy-guided view sampling, parameter-efficient fine-tuning, and combines optimal-transport alignment with uncertainty-aware prototype alignment to bridge source-target gaps while preserving class separability. Experiments on PointDA-10 and GraspNetPC-10 report consistent 3-16% accuracy gains over CLIP-based and conventional encoder-based baselines.

Significance. If the reported gains prove robust, the work would be significant as the first demonstration that a largely frozen CLIP backbone, augmented only by prompt tuning, depth-map projection, and standard alignment losses, can close synthetic-to-real 3D domain gaps in a few-shot unsupervised setting. This offers a parameter-efficient alternative to training heavy 3D encoders from scratch and opens a path for language-grounded 3D adaptation.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.
  2. [§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.
minor comments (2)
  1. [§3.1] §3.1: the precise architecture of the lightweight 3D encoder and the exact prompt-tuning parameters should be stated explicitly (e.g., number of learnable tokens and which CLIP layers are updated).
  2. [Tables 1-2] Table 1 and Table 2: add a row or column reporting the number of trainable parameters and inference-time FLOPs relative to the baselines to substantiate the efficiency advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the experimental reporting and methodological details.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.

    Authors: We agree that the current presentation would benefit from greater statistical rigor. In the revised manuscript we will report all main results as means and standard deviations over five random seeds, include expanded ablation tables that vary the entropy threshold and number of views, and add paired t-tests (or equivalent) against the strongest baselines to establish significance. These additions will directly address concerns about stability and hyper-parameter sensitivity. revision: yes

  2. Referee: [§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.

    Authors: We will expand §3.2 with the precise entropy formula, the fixed projection count (eight depth maps per sample), and the source-only procedure used to set the threshold. The threshold is chosen on source validation data to retain a minimum number of low-entropy views and is never adjusted using target samples or labels. Revised text will include pseudocode and an explicit statement confirming that no target-domain information influences the sampling, thereby preserving the unsupervised protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; minor self-citation not load-bearing

full rationale

The derivation chain relies on standard components (depth-map projection of point clouds, entropy-guided view sampling, prompt tuning of frozen CLIP, optimal transport alignment, and uncertainty-aware prototype alignment) whose loss formulations and motivations are independent of the final reported accuracy gains. No equation reduces a prediction to a fitted input by construction, no uniqueness theorem is imported from the same authors to force the architecture, and no ansatz is smuggled via self-citation. The 3-16% gains are presented as empirical outcomes on standard benchmarks rather than tautological re-statements of the alignment objectives. A score of 2 accounts for routine self-citation of prior CLIP adaptation work that is not load-bearing for the central claim.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that CLIP's image encoder can be steered via prompt tuning on depth maps and that standard alignment losses suffice for 3D domain gaps; no new physical entities are introduced.

free parameters (2)
  • prompt tuning parameters
    Knowledge-driven prompt tuning integrates high-level language priors; exact number and initialization not specified in abstract.
  • entropy threshold for view sampling
    Entropy-guided view sampling selects confident projections; threshold value is a tunable hyper-parameter.
axioms (2)
  • domain assumption Depth-map projections preserve sufficient geometric information for CLIP to reason about 3D objects
    Central to the pipeline; invoked when 3D samples are turned into multiple depth maps for the frozen CLIP backbone.
  • domain assumption Optimal transport and uncertainty-aware prototype alignment close distribution gaps without destroying class separability
    Invoked in the description of the two collaborative losses.

pith-pipeline@v0.9.0 · 5560 in / 1469 out tokens · 41617 ms · 2026-05-15T20:02:37.911634+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    BioVLM achieves state-of-the-art cross-modality generalization on biomedical VLMs by learning a prompt bank and routing inputs to the most discriminative prompts via low-entropy selection plus LLM distillation.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 2, 12, 16

  2. [2]

    Self- supervised learning for domain adaptation on point clouds

    Idan Achituve, Haggai Maron, and Gal Chechik. Self- supervised learning for domain adaptation on point clouds. InProceedings of the IEEE/CVF winter conference on ap- plications of computer vision, pages 123–133, 2021. 2, 7

  3. [3]

    Pre-train or annotate? do- main adaptation with a constrained budget.Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, 2021

    Fan Bai, Alan Ritter, and Wei Xu. Pre-train or annotate? do- main adaptation with a constrained budget.Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, 2021. 1

  4. [4]

    Analysis of representations for domain adapta- tion.Advances in neural information processing systems, 19, 2006

    Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adapta- tion.Advances in neural information processing systems, 19, 2006. 6

  5. [5]

    Challenges in fusion of heterogeneous point clouds.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:155– 162, 2018

    Fabio Bracci, Martin Drauschke, Stefan K ¨uhne, and Z- C M ´arton. Challenges in fusion of heterogeneous point clouds.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:155– 162, 2018. 1

  6. [6]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015. 6, 12, 13

  7. [7]

    Canonical shape projection is all you need for 3d few-shot class incremental learning

    Ali Cheraghian, Zeeshan Hayder, Sameera Ramasinghe, Shafin Rahman, Javad Jafaryahya, Lars Petersson, and Mehrtash Harandi. Canonical shape projection is all you need for 3d few-shot class incremental learning. InEuropean Conference on Computer Vision, pages 36–53. Springer,

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 6, 12, 13

  9. [9]

    Mvf- pointclip: Training-free multi-view fusion pointclip for zero- shot 3d classification.Neurocomputing, 653:131188, 2025

    Jiuqian Dai, Zhenyan Ji, Zechang Xiong, Guiping Zhu, Hui Liu, Shen Yin, and Jose Enrique Armendariz-Inigo. Mvf- pointclip: Training-free multi-view fusion pointclip for zero- shot 3d classification.Neurocomputing, 653:131188, 2025. 2

  10. [10]

    Domain-agnostic mutual prompting for unsuper- vised domain adaptation

    Zhekai Du, Xinyao Li, Fengling Li, Ke Lu, Lei Zhu, and Jingjing Li. Domain-agnostic mutual prompting for unsuper- vised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23375–23384, 2024. 3

  11. [11]

    Graspnet-1billion: A large-scale benchmark for general ob- ject grasping

    Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444– 11453, 2020. 6, 12, 13

  12. [12]

    Rethinking few-shot adaptation of vision- language models in two stages

    Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29989–29998, 2025. 7

  13. [13]

    Unsupervised domain adaptation by backpropagation

    Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational conference on machine learning, pages 1180–1189. PMLR, 2015. 1

  14. [14]

    Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35. 1, 2, 5, 7, 12, 14, 15

  15. [15]

    Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 14

  16. [16]

    Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 2023

    Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 2023. 3

  17. [17]

    Revisiting point cloud shape classification with a simple and effective baseline

    Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational conference on machine learning, pages 3809–3820. PMLR, 2021. 4, 12

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2, 12, 16

  19. [19]

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 1

  20. [20]

    Unsupervised multi-task feature learning on point clouds

    Kaveh Hassani and Mike Haley. Unsupervised multi-task feature learning on point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 8160–8171, 2019. 7

  21. [21]

    Progressive dis- tribution bridging: Unsupervised adaptation for large-scale pre-trained models via adaptive auxiliary data

    Weinan He, Yixin Zhang, and Zilei Wang. Progressive dis- tribution bridging: Unsupervised adaptation for large-scale pre-trained models via adaptive auxiliary data. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3280–3292, 2025. 3

  22. [22]

    Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition

    Deepti Hegde, Jeya Maria Jose Valanarasu, and Vishal Pa- tel. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2028– 2038, 2023. 2, 3

  23. [23]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 8

  24. [24]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 7

  25. [25]

    Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

    Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2

  26. [26]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 4

  27. [27]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muham- mad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 4, 7

  28. [28]

    How to adapt your large-scale vision-and-language model

    Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak Pathak. How to adapt your large-scale vision-and-language model. 2021. 7

  29. [29]

    Resource efficient 3d convolutional neural networks

    Okan Kopuklu, Neslihan Kose, Ahmet Gunduz, and Gerhard Rigoll. Resource efficient 3d convolutional neural networks. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1

  30. [30]

    Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation

    Zhengfeng Lai, Noranart Vesdapunt, Ning Zhou, Jun Wu, Cong Phuoc Huynh, Xuelu Li, Kah Kuen Fu, and Chen-Nee Chuah. Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16155–16165, 2023. 3

  31. [31]

    The power of scale for parameter-efficient prompt tuning.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 4

  32. [32]

    Semantic concentration for domain adaptation

    Shuang Li, Mixue Xie, Fangrui Lv, Chi Harold Liu, Jian Liang, Chen Qin, and Wei Li. Semantic concentration for domain adaptation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9102–9111,

  33. [33]

    Split to merge: Unifying separated modali- ties for unsupervised domain adaptation

    Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, and Jingjing Li. Split to merge: Unifying separated modali- ties for unsupervised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23364–23374, 2024. 3

  34. [34]

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), 2021. 4

  35. [35]

    Scaling down to scale up: A guide to parameter-efficient fine-tuning

    Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arxiv 2023.arXiv preprint arXiv:2303.15647, 6, 2023. 2

  36. [36]

    Point cloud do- main adaptation via masked local 3d structure prediction

    Hanxue Liang, Hehe Fan, Zhiwen Fan, Yi Wang, Tianlong Chen, Yu Cheng, and Zhangyang Wang. Point cloud do- main adaptation via masked local 3d structure prediction. In European conference on computer vision, pages 156–172. Springer, 2022. 1, 2, 7

  37. [37]

    Learning transferable features with deep adaptation net- works

    Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor- dan. Learning transferable features with deep adaptation net- works. InInternational conference on machine learning, pages 97–105. PMLR, 2015. 1, 5, 8

  38. [38]

    Conditional adversarial domain adapta- tion.Advances in neural information processing systems, 31, 2018

    Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adapta- tion.Advances in neural information processing systems, 31, 2018. 1, 12, 14, 15

  39. [39]

    Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008. 8

  40. [40]

    Visual classification via description from large language models

    Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InInternational Conference on Learning Representations, 2023. 13

  41. [41]

    Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024

    Munish Monga, Sachin Kumar Giroh, Ankit Jha, Mainak Singha, Biplab Banerjee, and Jocelyn Chanussot. Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024. 1

  42. [42]

    Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025. August 7, 2025. 2, 4, 6, 12, 13, 16

  43. [43]

    Georeferenced point clouds: A survey of features and point cloud manage- ment.ISPRS International Journal of Geo-Information, 2 (4):1038–1065, 2013

    Johannes Otepka, Sajid Ghuffar, Christoph Waldhauser, Ronald Hochreiter, and Norbert Pfeifer. Georeferenced point clouds: A survey of features and point cloud manage- ment.ISPRS International Journal of Geo-Information, 2 (4):1038–1065, 2013. 1

  44. [44]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  45. [45]

    Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.Advances in Neural In- formation Processing Systems, 32, 2019

    Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.Advances in Neural In- formation Processing Systems, 32, 2019. 1, 2, 6, 7, 13

  46. [46]

    Improving lan- guage understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Sal- imans, Ilya Sutskever, et al. Improving lan- guage understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf,

  47. [47]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7, 15

  48. [48]

    Domain generalization for semantic segmentation: a survey.Artificial Intelligence Re- view, 57(9):247, 2024

    Taki Hasan Rafi, Ratul Mahjabin, Emon Ghosh, Young- Woong Ko, and Jeong-Gun Lee. Domain generalization for semantic segmentation: a survey.Artificial Intelligence Re- view, 57(9):247, 2024. 1

  49. [49]

    The- oretical analysis of domain adaptation with optimal trans- port

    Ievgen Redko, Amaury Habrard, and Marc Sebban. The- oretical analysis of domain adaptation with optimal trans- port. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer, 2017. 6

  50. [50]

    A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951. 6

  51. [51]

    Self-supervised deep learning on point clouds by reconstructing space.Advances in neural information processing systems, 32, 2019

    Jonathan Sauder and Bjarne Sievers. Self-supervised deep learning on point clouds by reconstructing space.Advances in neural information processing systems, 32, 2019. 2, 7

  52. [52]

    Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification

    Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3596–3605, 2024. 2

  53. [53]

    Domain adaptation on point clouds via geometry-aware implicits

    Yuefan Shen, Yanchao Yang, Mi Yan, He Wang, Youyi Zheng, and Leonidas J Guibas. Domain adaptation on point clouds via geometry-aware implicits. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7223–7232, 2022. 2, 6, 7, 13

  54. [54]

    Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 2

  55. [55]

    Ad-clip: Adapting domains in prompt space using clip

    Mainak Singha, Harsh Pal, Ankit Jha, and Biplab Banerjee. Ad-clip: Adapting domains in prompt space using clip. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4355–4364, 2023. 3

  56. [56]

    Fedmvp: Federated multi-modal visual prompt tuning for vision-language models.Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025

    Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, and Elisa Ricci. Fedmvp: Federated multi-modal visual prompt tuning for vision-language models.Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 7, 13

  57. [57]

    Multi-view convolutional neural networks for 3d shape recognition

    Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,

  58. [58]

    Adversarial discriminative domain adaptation

    Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017. 1

  59. [59]

    Learning discriminative features by covering local geometric space for point cloud analysis

    Changshuo Wang, Xin Ning, Linjun Sun, Liping Zhang, Weijun Li, and Xiao Bai. Learning discriminative features by covering local geometric space for point cloud analysis. IEEE Transactions on Geoscience and Remote Sensing, 60: 1–15, 2022. 1

  60. [60]

    3d-centernet: 3d object detection network for point clouds with center estimation priority.Pattern Recognition, 115: 107884, 2021

    Qi Wang, Jian Chen, Jianqiang Deng, and Xinfang Zhang. 3d-centernet: 3d object detection network for point clouds with center estimation priority.Pattern Recognition, 115: 107884, 2021. 1

  61. [61]

    Improving point cloud classification and segmentation via parametric veronese mapping.Pattern Recognition, 144:109784, 2023

    Ruibin Wang, Xianghua Ying, Bowei Xing, Xin Tong, Taiyan Chen, Jinfa Yang, and Yongjie Shi. Improving point cloud classification and segmentation via parametric veronese mapping.Pattern Recognition, 144:109784, 2023. 1

  62. [62]

    Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019

    Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019. 1

  63. [63]

    3d shapenets: A deep representation for volumetric shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 6, 12, 13

  64. [64]

    Seeing 3d through 2d lenses: 3d few-shot class-incremental learning via cross-modal geometric rectifi- cation

    Tuo Xiang, Xuemiao Xu, Bangzhen Liu, Jinyi Li, Yong Li, and Shengfeng He. Seeing 3d through 2d lenses: 3d few-shot class-incremental learning via cross-modal geometric rectifi- cation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6761–6771, 2025. 2, 3

  65. [65]

    Filp-3d: Enhancing 3d few-shot class-incremental learning with pre-trained vision- language models.Pattern Recognition, 165:111558, 2025

    Wan Xu, Tianyu Huang, Tianyuan Qu, Guanglei Yang, Yi- wen Guo, and Wangmeng Zuo. Filp-3d: Enhancing 3d few-shot class-incremental learning with pre-trained vision- language models.Pattern Recognition, 165:111558, 2025. 2, 3

  66. [66]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 2, 12, 16

  67. [67]

    Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bit- fit: Simple parameter-efficient fine-tuning for transformer- based masked language-models.Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), 2021. 7

  68. [68]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 7

  69. [69]

    Clip2: Contrastive language- image-point pretraining from real-world point cloud data

    Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language- image-point pretraining from real-world point cloud data. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15244–15253, 2023. 3

  70. [70]

    Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,

    Huang Zhang, Xin Ning, Changshuo Wang, Enhao Ning, and Lusi Li. Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,

  71. [71]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022. 2, 4, 5, 7, 15

  72. [72]

    Factual probing is [mask]: Learning vs

    Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to recall.Proceed- ings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, 2021. 4

  73. [73]

    Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

  74. [74]

    Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2639–2650, 2023. 2, 7, 15

  75. [75]

    Geometry- aware self-training for unsupervised domain adaptation on object point clouds

    Longkun Zou, Hui Tang, Ke Chen, and Kui Jia. Geometry- aware self-training for unsupervised domain adaptation on object point clouds. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6403–6412,

  76. [76]

    1, 2, 7 A. Supplementary Contents In this supplementary document, we present detailed infor- mation and further experimental results, including: 1.Dataset descriptions:In Table A1, we provide the to- tal number of point cloud samples in the training and test splits of each domain of each datasets, though we do few-shot training in our proposed method. 2.L...

  77. [77]

    A point cloud object of

    into our CLIP-based baselines (Zs-CLIP, PointCLIP & PointCLIPv2) to train them for point-cloud UDA task. While these conventional UDA methods can bring modest improvements in certain cross-domain transfers, their gains are inconsistent and often fail to fully bridge the domain gap inherent in 3D point cloud data. To be noted, we have just added a learnabl...