arxiv: 2602.20409 · v2 · submitted 2026-02-23 · 💻 cs.CV · cs.LG

Recognition: no theorem link

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Mainak Singha , Sarthak Mehrotra , Paolo Casari , Subhasis Chaudhuri , Elisa Ricci , Biplab Banerjee

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords 3D point cloud domain adaptationfew-shot unsupervised learningCLIP vision-language modeldepth map projectionoptimal transport alignmentprompt tuningsynthetic to real transfer

0 comments

The pith

CLIPoint3D adapts a frozen CLIP model to few-shot 3D point cloud domain adaptation via depth map projections and alignment losses

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CLIPoint3D as a way to adapt vision-language models like CLIP for 3D point cloud tasks when data comes from different domains and only a few labels are available. It converts each 3D sample into several 2D depth map views that the pre-trained CLIP image encoder can process directly. Lightweight prompt tuning adds language knowledge and geometric cues, while entropy-guided sampling picks the most useful views. Optimal transport and prototype alignment then match source and target distributions without losing the ability to separate classes. The result is steady accuracy gains over prior methods on standard benchmarks, which matters for practical 3D applications that cannot afford to train large new encoders from scratch.

Core claim

CLIPoint3D is the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP; it projects points to multiple depth maps, applies knowledge-driven prompt tuning and entropy-guided sampling, then uses optimal transport alignment plus uncertainty-aware prototype alignment to close synthetic-to-real gaps, yielding 3-16 percent accuracy gains on PointDA-10 and GraspNetPC-10.

What carries the argument

Projection of 3D point clouds into multiple depth maps processed by a frozen CLIP backbone, refined by knowledge-driven prompt tuning that mixes language priors with cues from a lightweight 3D encoder, plus entropy-guided view sampling and optimal transport alignment losses.

If this is right

Domain adaptation for 3D data becomes possible with far less computation than retraining full encoders.
Language priors from VLMs can substitute for some 3D-specific training signals during adaptation.
Class boundaries remain intact when alignment losses are combined with prototype methods.
The approach scales to real-world settings like robotics where synthetic data is abundant but real labels are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar depth map projections could be tested on other 3D representations such as meshes or voxels to check if the same gains appear.
The framework hints that vision-language priors might reduce reliance on purely geometric 3D training for cross-domain tasks.
A direct extension would measure performance when the number of shots drops to one or zero to see how far language guidance can stretch.
This opens a route to apply the same idea to 3D segmentation or object detection without new large-scale 3D pre-training.

Load-bearing premise

Projecting 3D points to depth maps together with entropy-guided sampling and optimal transport alignment will reliably close the domain gap while keeping classes separable even when the CLIP backbone stays largely frozen.

What would settle it

A benchmark test on scenes where depth map projections lose critical geometric detail, such as heavy occlusion, showing no accuracy gain or outright loss compared with standard baselines.

Figures

Figures reproduced from arXiv: 2602.20409 by Biplab Banerjee, Elisa Ricci, Mainak Singha, Paolo Casari, Sarthak Mehrotra, Subhasis Chaudhuri.

**Figure 1.** Figure 1: Comparison of CLIPoint3D with SOTA methods on GraspNetPC-10. Encoder-based 3D UDA methods (e.g., PointDAN [45], GAST [75], MLSP [36]) are accurate but computationally expensive, while CLIP-based extensions fail to bridge the synthetic-real gap. CLIPoint3D achieves +16.4% improvement with minimal overhead. most assume identical training and deployment distributions. In practice, scans acquired from heter… view at source ↗

**Figure 2.** Figure 2: Overview of CLIPoint3D, the first CLIP-based unsupervised 3D point cloud domain adaptation framework, comprises four key modules: (1) Knowledge-driven prompt tuning generates LLM-guided textual and 3D-aware visual prompts; (2) Parameter-efficient fine-tuning (PEFT) jointly optimizes these prompts and the encoder while (3) entropy-based view selection filters unreliable projections; (4) Dual objectives, unc… view at source ↗

**Figure 3.** Figure 3: (a) Effect of the number of labeled samples in Ds during training. (b) Effect of projected views. Accuracy variation with projection count M. (v) Influence of projected views. As shown in Figure 3b, increasing the number of 2D projections M enhances performance by enriching multi-view cues. However, we achieve the maximum accuracy at M=10, where additional views are found to add redundancy with minimal gai… view at source ↗

read the original abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Project page: https://sarthakm320.github.io/CLIPoint3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLIPoint3D adapts CLIP to 3D point cloud shifts via depth-map projection and prompt tuning while keeping the backbone mostly frozen, with reported gains on standard benchmarks, but the experiments need tighter controls to confirm robustness.

read the letter

CLIPoint3D is a practical effort to adapt CLIP for few-shot unsupervised domain adaptation on 3D point clouds without heavy retraining. It projects points to multiple depth maps, refines prompts with language priors plus cues from a lightweight 3D encoder, picks confident views via entropy, and aligns source and target with optimal transport plus uncertainty-aware prototype losses. The result is efficiency gains over full encoder retraining while still closing the synthetic-to-real gap on PointDA-10 and GraspNetPC-10. The 3-16% accuracy improvements over both CLIP-based and conventional baselines are the headline result, and the design choices line up logically with the stated problems of view selection and distribution shift. The paper does a clean job of keeping most of CLIP frozen and adding only parameter-efficient pieces, which matches the robotics use case where full retraining is expensive. The soft spot is the experimental validation. The abstract and description mention accuracy lifts but give no error bars, ablation tables, or statistical tests, so it is hard to judge whether the gains are stable or driven by specific choices like the entropy threshold. The depth-map bridge is reasonable but could discard 3D geometry that matters for some shifts, and without seeing the full controls it is unclear how sensitive the method is to benchmark quirks. This paper is for people working on efficient VLM adaptation to 3D perception. A reader who needs concrete ideas for prompt tuning and dual alignment losses in point clouds will get usable details. It deserves peer review because the core framing is sound and the efficiency angle is worth checking, even if the current evidence is still preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built on CLIP. It projects point clouds to multiple depth maps, applies knowledge-driven prompt tuning that combines language priors with cues from a lightweight 3D encoder, uses entropy-guided view sampling, parameter-efficient fine-tuning, and combines optimal-transport alignment with uncertainty-aware prototype alignment to bridge source-target gaps while preserving class separability. Experiments on PointDA-10 and GraspNetPC-10 report consistent 3-16% accuracy gains over CLIP-based and conventional encoder-based baselines.

Significance. If the reported gains prove robust, the work would be significant as the first demonstration that a largely frozen CLIP backbone, augmented only by prompt tuning, depth-map projection, and standard alignment losses, can close synthetic-to-real 3D domain gaps in a few-shot unsupervised setting. This offers a parameter-efficient alternative to training heavy 3D encoders from scratch and opens a path for language-grounded 3D adaptation.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.
[§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.

minor comments (2)

[§3.1] §3.1: the precise architecture of the lightweight 3D encoder and the exact prompt-tuning parameters should be stated explicitly (e.g., number of learnable tokens and which CLIP layers are updated).
[Tables 1-2] Table 1 and Table 2: add a row or column reporting the number of trainable parameters and inference-time FLOPs relative to the baselines to substantiate the efficiency advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the experimental reporting and methodological details.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of 3-16% accuracy gains is presented without error bars, ablation tables, or statistical significance tests. This absence makes it impossible to judge whether the improvements are stable across random seeds or sensitive to post-hoc hyper-parameter choices such as the entropy threshold.

Authors: We agree that the current presentation would benefit from greater statistical rigor. In the revised manuscript we will report all main results as means and standard deviations over five random seeds, include expanded ablation tables that vary the entropy threshold and number of views, and add paired t-tests (or equivalent) against the strongest baselines to establish significance. These additions will directly address concerns about stability and hyper-parameter sensitivity. revision: yes
Referee: [§3.2] §3.2 (Entropy-guided view sampling): the selection criterion is described only at a high level; it is unclear how the entropy threshold interacts with the number of projected views and whether it was tuned on the target domain, which would undermine the unsupervised claim.

Authors: We will expand §3.2 with the precise entropy formula, the fixed projection count (eight depth maps per sample), and the source-only procedure used to set the threshold. The threshold is chosen on source validation data to retain a minimum number of low-entropy views and is never adjusted using target samples or labels. Revised text will include pseudocode and an explicit statement confirming that no target-domain information influences the sampling, thereby preserving the unsupervised protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity; minor self-citation not load-bearing

full rationale

The derivation chain relies on standard components (depth-map projection of point clouds, entropy-guided view sampling, prompt tuning of frozen CLIP, optimal transport alignment, and uncertainty-aware prototype alignment) whose loss formulations and motivations are independent of the final reported accuracy gains. No equation reduces a prediction to a fitted input by construction, no uniqueness theorem is imported from the same authors to force the architecture, and no ansatz is smuggled via self-citation. The 3-16% gains are presented as empirical outcomes on standard benchmarks rather than tautological re-statements of the alignment objectives. A score of 2 accounts for routine self-citation of prior CLIP adaptation work that is not load-bearing for the central claim.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the assumption that CLIP's image encoder can be steered via prompt tuning on depth maps and that standard alignment losses suffice for 3D domain gaps; no new physical entities are introduced.

free parameters (2)

prompt tuning parameters
Knowledge-driven prompt tuning integrates high-level language priors; exact number and initialization not specified in abstract.
entropy threshold for view sampling
Entropy-guided view sampling selects confident projections; threshold value is a tunable hyper-parameter.

axioms (2)

domain assumption Depth-map projections preserve sufficient geometric information for CLIP to reason about 3D objects
Central to the pipeline; invoked when 3D samples are turned into multiple depth maps for the frozen CLIP backbone.
domain assumption Optimal transport and uncertainty-aware prototype alignment close distribution gaps without destroying class separability
Invoked in the description of the two collaborative losses.

pith-pipeline@v0.9.0 · 5560 in / 1469 out tokens · 41617 ms · 2026-05-15T20:02:37.911634+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
cs.CV 2026-04 unverdicted novelty 7.0

BioVLM achieves state-of-the-art cross-modality generalization on biomedical VLMs by learning a prompt bank and routing inputs to the most discriminative prompts via low-entropy selection plus LLM distillation.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 2, 12, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Self- supervised learning for domain adaptation on point clouds

Idan Achituve, Haggai Maron, and Gal Chechik. Self- supervised learning for domain adaptation on point clouds. InProceedings of the IEEE/CVF winter conference on ap- plications of computer vision, pages 123–133, 2021. 2, 7

work page 2021
[3]

Pre-train or annotate? do- main adaptation with a constrained budget.Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, 2021

Fan Bai, Alan Ritter, and Wei Xu. Pre-train or annotate? do- main adaptation with a constrained budget.Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, 2021. 1

work page 2021
[4]

Analysis of representations for domain adapta- tion.Advances in neural information processing systems, 19, 2006

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adapta- tion.Advances in neural information processing systems, 19, 2006. 6

work page 2006
[5]

Challenges in fusion of heterogeneous point clouds.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:155– 162, 2018

Fabio Bracci, Martin Drauschke, Stefan K ¨uhne, and Z- C M ´arton. Challenges in fusion of heterogeneous point clouds.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 42:155– 162, 2018. 1

work page 2018
[6]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015. 6, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Canonical shape projection is all you need for 3d few-shot class incremental learning

Ali Cheraghian, Zeeshan Hayder, Sameera Ramasinghe, Shafin Rahman, Javad Jafaryahya, Lars Petersson, and Mehrtash Harandi. Canonical shape projection is all you need for 3d few-shot class incremental learning. InEuropean Conference on Computer Vision, pages 36–53. Springer,

work page
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 6, 12, 13

work page 2017
[9]

Mvf- pointclip: Training-free multi-view fusion pointclip for zero- shot 3d classification.Neurocomputing, 653:131188, 2025

Jiuqian Dai, Zhenyan Ji, Zechang Xiong, Guiping Zhu, Hui Liu, Shen Yin, and Jose Enrique Armendariz-Inigo. Mvf- pointclip: Training-free multi-view fusion pointclip for zero- shot 3d classification.Neurocomputing, 653:131188, 2025. 2

work page 2025
[10]

Domain-agnostic mutual prompting for unsuper- vised domain adaptation

Zhekai Du, Xinyao Li, Fengling Li, Ke Lu, Lei Zhu, and Jingjing Li. Domain-agnostic mutual prompting for unsuper- vised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23375–23384, 2024. 3

work page 2024
[11]

Graspnet-1billion: A large-scale benchmark for general ob- ject grasping

Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet-1billion: A large-scale benchmark for general ob- ject grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444– 11453, 2020. 6, 12, 13

work page 2020
[12]

Rethinking few-shot adaptation of vision- language models in two stages

Matteo Farina, Massimiliano Mancini, Giovanni Iacca, and Elisa Ricci. Rethinking few-shot adaptation of vision- language models in two stages. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29989–29998, 2025. 7

work page 2025
[13]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational conference on machine learning, pages 1180–1189. PMLR, 2015. 1

work page 2015
[14]

Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35. 1, 2, 5, 7, 12, 14, 15

work page
[15]

Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision, 132(2): 581–595, 2024. 14

work page 2024
[16]

Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 2023

Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 2023. 3

work page 2023
[17]

Revisiting point cloud shape classification with a simple and effective baseline

Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational conference on machine learning, pages 3809–3820. PMLR, 2021. 4, 12

work page 2021
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2, 12, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 1

work page 2018
[20]

Unsupervised multi-task feature learning on point clouds

Kaveh Hassani and Mike Haley. Unsupervised multi-task feature learning on point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 8160–8171, 2019. 7

work page 2019
[21]

Progressive dis- tribution bridging: Unsupervised adaptation for large-scale pre-trained models via adaptive auxiliary data

Weinan He, Yixin Zhang, and Zilei Wang. Progressive dis- tribution bridging: Unsupervised adaptation for large-scale pre-trained models via adaptive auxiliary data. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3280–3292, 2025. 3

work page 2025
[22]

Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition

Deepti Hegde, Jeya Maria Jose Valanarasu, and Vishal Pa- tel. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2028– 2038, 2023. 2, 3

work page 2028
[23]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 8

work page 2017
[24]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4, 6, 7

work page 2022
[25]

Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2

work page 2023
[26]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 4

work page 2022
[27]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muham- mad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 4, 7

work page 2023
[28]

How to adapt your large-scale vision-and-language model

Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak Pathak. How to adapt your large-scale vision-and-language model. 2021. 7

work page 2021
[29]

Resource efficient 3d convolutional neural networks

Okan Kopuklu, Neslihan Kose, Ahmet Gunduz, and Gerhard Rigoll. Resource efficient 3d convolutional neural networks. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 1

work page 2019
[30]

Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation

Zhengfeng Lai, Noranart Vesdapunt, Ning Zhou, Jun Wu, Cong Phuoc Huynh, Xuelu Li, Kah Kuen Fu, and Chen-Nee Chuah. Padclip: Pseudo-labeling with adaptive debiasing in clip for unsupervised domain adaptation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16155–16165, 2023. 3

work page 2023
[31]

The power of scale for parameter-efficient prompt tuning.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 4

work page 2021
[32]

Semantic concentration for domain adaptation

Shuang Li, Mixue Xie, Fangrui Lv, Chi Harold Liu, Jian Liang, Chen Qin, and Wei Li. Semantic concentration for domain adaptation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9102–9111,

work page
[33]

Split to merge: Unifying separated modali- ties for unsupervised domain adaptation

Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, and Jingjing Li. Split to merge: Unifying separated modali- ties for unsupervised domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23364–23374, 2024. 3

work page 2024
[34]

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.Proceedings of the 59th Annual Meeting of the Association for Computational Lin- guistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), 2021. 4

work page 2021
[35]

Scaling down to scale up: A guide to parameter-efficient fine-tuning

Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arxiv 2023.arXiv preprint arXiv:2303.15647, 6, 2023. 2

work page arXiv 2023
[36]

Point cloud do- main adaptation via masked local 3d structure prediction

Hanxue Liang, Hehe Fan, Zhiwen Fan, Yi Wang, Tianlong Chen, Yu Cheng, and Zhangyang Wang. Point cloud do- main adaptation via masked local 3d structure prediction. In European conference on computer vision, pages 156–172. Springer, 2022. 1, 2, 7

work page 2022
[37]

Learning transferable features with deep adaptation net- works

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jor- dan. Learning transferable features with deep adaptation net- works. InInternational conference on machine learning, pages 97–105. PMLR, 2015. 1, 5, 8

work page 2015
[38]

Conditional adversarial domain adapta- tion.Advances in neural information processing systems, 31, 2018

Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adapta- tion.Advances in neural information processing systems, 31, 2018. 1, 12, 14, 15

work page 2018
[39]

Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008. 8

work page 2008
[40]

Visual classification via description from large language models

Sachit Menon and Carl V ondrick. Visual classification via description from large language models. InInternational Conference on Learning Representations, 2023. 13

work page 2023
[41]

Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024

Munish Monga, Sachin Kumar Giroh, Ankit Jha, Mainak Singha, Biplab Banerjee, and Jocelyn Chanussot. Cosmo: Clip talks on open-set multi-target domain adaptation.arXiv preprint arXiv:2409.00397, 2024. 1

work page arXiv 2024
[42]

Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ index/introducing-gpt-5/, 2025. August 7, 2025. 2, 4, 6, 12, 13, 16

work page 2025
[43]

Georeferenced point clouds: A survey of features and point cloud manage- ment.ISPRS International Journal of Geo-Information, 2 (4):1038–1065, 2013

Johannes Otepka, Sajid Ghuffar, Christoph Waldhauser, Ronald Hochreiter, and Norbert Pfeifer. Georeferenced point clouds: A survey of features and point cloud manage- ment.ISPRS International Journal of Geo-Information, 2 (4):1038–1065, 2013. 1

work page 2013
[44]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[45]

Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.Advances in Neural In- formation Processing Systems, 32, 2019

Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.Advances in Neural In- formation Processing Systems, 32, 2019. 1, 2, 6, 7, 13

work page 2019
[46]

Improving lan- guage understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Sal- imans, Ilya Sutskever, et al. Improving lan- guage understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf,

work page
[47]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7, 15

work page 2021
[48]

Domain generalization for semantic segmentation: a survey.Artificial Intelligence Re- view, 57(9):247, 2024

Taki Hasan Rafi, Ratul Mahjabin, Emon Ghosh, Young- Woong Ko, and Jeong-Gun Lee. Domain generalization for semantic segmentation: a survey.Artificial Intelligence Re- view, 57(9):247, 2024. 1

work page 2024
[49]

The- oretical analysis of domain adaptation with optimal trans- port

Ievgen Redko, Amaury Habrard, and Marc Sebban. The- oretical analysis of domain adaptation with optimal trans- port. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer, 2017. 6

work page 2017
[50]

A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approxi- mation method.The annals of mathematical statistics, pages 400–407, 1951. 6

work page 1951
[51]

Self-supervised deep learning on point clouds by reconstructing space.Advances in neural information processing systems, 32, 2019

Jonathan Sauder and Bjarne Sievers. Self-supervised deep learning on point clouds by reconstructing space.Advances in neural information processing systems, 32, 2019. 2, 7

work page 2019
[52]

Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. Diffclip: Leveraging stable diffusion for lan- guage grounded 3d classification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3596–3605, 2024. 2

work page 2024
[53]

Domain adaptation on point clouds via geometry-aware implicits

Yuefan Shen, Yanchao Yang, Mi Yan, He Wang, Youyi Zheng, and Leonidas J Guibas. Domain adaptation on point clouds via geometry-aware implicits. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7223–7232, 2022. 2, 6, 7, 13

work page 2022
[54]

Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022

Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 2

work page 2022
[55]

Ad-clip: Adapting domains in prompt space using clip

Mainak Singha, Harsh Pal, Ankit Jha, and Biplab Banerjee. Ad-clip: Adapting domains in prompt space using clip. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4355–4364, 2023. 3

work page 2023
[56]

Fedmvp: Federated multi-modal visual prompt tuning for vision-language models.Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025

Mainak Singha, Subhankar Roy, Sarthak Mehrotra, Ankit Jha, Moloud Abdar, Biplab Banerjee, and Elisa Ricci. Fedmvp: Federated multi-modal visual prompt tuning for vision-language models.Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 7, 13

work page 2025
[57]

Multi-view convolutional neural networks for 3d shape recognition

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,

work page
[58]

Adversarial discriminative domain adaptation

Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017. 1

work page 2017
[59]

Learning discriminative features by covering local geometric space for point cloud analysis

Changshuo Wang, Xin Ning, Linjun Sun, Liping Zhang, Weijun Li, and Xiao Bai. Learning discriminative features by covering local geometric space for point cloud analysis. IEEE Transactions on Geoscience and Remote Sensing, 60: 1–15, 2022. 1

work page 2022
[60]

3d-centernet: 3d object detection network for point clouds with center estimation priority.Pattern Recognition, 115: 107884, 2021

Qi Wang, Jian Chen, Jianqiang Deng, and Xinfang Zhang. 3d-centernet: 3d object detection network for point clouds with center estimation priority.Pattern Recognition, 115: 107884, 2021. 1

work page 2021
[61]

Improving point cloud classification and segmentation via parametric veronese mapping.Pattern Recognition, 144:109784, 2023

Ruibin Wang, Xianghua Ying, Bowei Xing, Xin Tong, Taiyan Chen, Jinfa Yang, and Yongjie Shi. Improving point cloud classification and segmentation via parametric veronese mapping.Pattern Recognition, 144:109784, 2023. 1

work page 2023
[62]

Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transactions on Graphics (tog), 38(5):1–12, 2019. 1

work page 2019
[63]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015. 6, 12, 13

work page 1912
[64]

Seeing 3d through 2d lenses: 3d few-shot class-incremental learning via cross-modal geometric rectifi- cation

Tuo Xiang, Xuemiao Xu, Bangzhen Liu, Jinyi Li, Yong Li, and Shengfeng He. Seeing 3d through 2d lenses: 3d few-shot class-incremental learning via cross-modal geometric rectifi- cation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6761–6771, 2025. 2, 3

work page 2025
[65]

Filp-3d: Enhancing 3d few-shot class-incremental learning with pre-trained vision- language models.Pattern Recognition, 165:111558, 2025

Wan Xu, Tianyu Huang, Tianyuan Qu, Guanglei Yang, Yi- wen Guo, and Wangmeng Zuo. Filp-3d: Enhancing 3d few-shot class-incremental learning with pre-trained vision- language models.Pattern Recognition, 165:111558, 2025. 2, 3

work page 2025
[66]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 2, 12, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bit- fit: Simple parameter-efficient fine-tuning for transformer- based masked language-models.Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), 2021. 7

work page 2021
[68]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 7

work page 2024
[69]

Clip2: Contrastive language- image-point pretraining from real-world point cloud data

Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, and Hang Xu. Clip2: Contrastive language- image-point pretraining from real-world point cloud data. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15244–15253, 2023. 3

work page 2023
[70]

Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,

Huang Zhang, Xin Ning, Changshuo Wang, Enhao Ning, and Lusi Li. Deformation depth decoupling network for point cloud domain adaptation.Neural Networks, 180:106626,

work page
[71]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022. 2, 4, 5, 7, 15

work page 2022
[72]

Factual probing is [mask]: Learning vs

Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [mask]: Learning vs. learning to recall.Proceed- ings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, 2021. 4

work page 2021
[73]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[74]

Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Point- clip v2: Prompting clip and gpt for powerful 3d open-world learning. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2639–2650, 2023. 2, 7, 15

work page 2023
[75]

Geometry- aware self-training for unsupervised domain adaptation on object point clouds

Longkun Zou, Hui Tang, Ke Chen, and Kui Jia. Geometry- aware self-training for unsupervised domain adaptation on object point clouds. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6403–6412,

work page
[76]

1, 2, 7 A. Supplementary Contents In this supplementary document, we present detailed infor- mation and further experimental results, including: 1.Dataset descriptions:In Table A1, we provide the to- tal number of point cloud samples in the training and test splits of each domain of each datasets, though we do few-shot training in our proposed method. 2.L...

work page
[77]

A point cloud object of

into our CLIP-based baselines (Zs-CLIP, PointCLIP & PointCLIPv2) to train them for point-cloud UDA task. While these conventional UDA methods can bring modest improvements in certain cross-domain transfers, their gains are inconsistent and often fail to fully bridge the domain gap inherent in 3D point cloud data. To be noted, we have just added a learnabl...

work page