pith. machine review for the scientific record. sign in

arxiv: 2602.23069 · v2 · submitted 2026-02-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient transfer learningpoint cloud video4D perceptionoptimal transport3D to 4D adaptationaction recognitionsemantic segmentationtemporal modeling
0
0 comments X

The pith

PointATA transfers pre-trained 3D models to 4D point cloud video tasks by first aligning distributions with optimal transport then adding lightweight temporal adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that 4D point cloud video data is scarce compared to 3D data, which limits scalable self-supervised models, and that direct transfer from 3D pre-trained models suffers from overfitting and a modality gap. It proposes a two-stage Align then Adapt process where optimal transport first quantifies and reduces the 3D-4D distribution difference via a point align embedder, then a frozen 3D backbone receives a point-video adapter and spatial-context encoder for temporal capacity. This setup allows models without built-in temporal knowledge to handle dynamic scenes at low parameter cost. A sympathetic reader would care because it offers a practical route to strong 4D performance in robotics without requiring massive new 4D datasets or full retraining.

Core claim

PointATA decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory quantifies the distributional discrepancy between 3D and 4D datasets so that a point align embedder can be trained in stage one to close the modality gap. In stage two an efficient point-video adapter and a spatial-context encoder are added to the frozen 3D backbone to supply temporal modeling while avoiding overfitting. With these designs a pre-trained 3D model without temporal knowledge can reason about dynamic video content at lower parameter cost than prior work, matching or exceeding full fine-tuning on action recognition, action segmentation, and semantic segmentation.

What carries the argument

The Align then Adapt paradigm: optimal transport measures the 3D-4D gap to train the point align embedder in stage one, while the point-video adapter and spatial-context encoder supply temporal modeling in stage two on a frozen backbone.

If this is right

  • A pre-trained 3D model reaches 97.21 percent accuracy on 3D action recognition after the two-stage transfer.
  • The method improves 4D action segmentation by 8.7 percent over earlier parameter-efficient baselines.
  • 4D semantic segmentation accuracy reaches 84.06 percent while training far fewer parameters than full fine-tuning.
  • The frozen backbone plus lightweight adapters reduces overfitting risk relative to unfrozen adaptation on video data.
  • The approach matches or exceeds full fine-tuning performance on 4D tasks at substantially lower parameter cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Optimal transport alignment may extend to other 3D-to-4D transfers such as medical or driving scenes.
  • If the align stage generalizes, the volume of 4D data needed for competitive robotics models could drop.
  • Applying the same two-stage split to other modality gaps, such as image-to-video or static-to-dynamic, would be a direct test.

Load-bearing premise

The distributional discrepancy between 3D and 4D datasets can be effectively quantified by optimal transport and alleviated by the point align embedder without degrading the pre-trained 3D features or introducing new overfitting risks.

What would settle it

An experiment in which the point align embedder increases the measured modality gap or produces lower 4D task accuracy than a direct-adaptation baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.23069 by Chaoyi Lu, Haozhe Cheng, Jihua Zhu, Lin Chen, Yaonan Wang, Yiding Sun, Zhichuan Yang.

Figure 1
Figure 1. Figure 1: Current 4D PETL methods face two limits. Upper panel: Adapters are feasible (Smiley), yet current methods expose models to severe overfitting (Crying). Lower panel: Cross-modal transfer needs prior alignment. This practice is mature in 2D Vision and NLP. But to our knowledge, no 4D PETL study measures this gap (Crying). Without considering the gap, it will hurt downstream performance undoubtedly (Crying). … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of 3D full fine-tuning, 4D full fine-tuning, 4D adapter tuning, and our PointATA. Considering that the embedder required for 4D encoding is often heavier than that for 3D, when facing a 4D dynamic perception task, the number of parameters that need to be updated is even greater (over 100%) than the full update for 3D. PointATA saves large amount of resource and time than full fine-tuning and sig… view at source ↗
Figure 3
Figure 3. Figure 3: PointATA employs a two-stage workflow to quickly adapt large 3D pre-trained models to diverse 4D downstream tasks. In Stage 1, it obtains 4D features via the P4D embedder and learns by minimizing the distribution distance to 3D source features. The weight of P4D embedder is randomly initialized. In Stage 2, it jointly fine-tunes the P4D embedder and the PVA to minimize task loss. The 3D backbone remains fr… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of action segmentation. P4Transformer has a serious over-segmentation problem. TABLE IV 4D SEMANTIC SEGMENTATION RESULTS ON SYNTHIA 4D DATASET. Method Reference Input Frame mIoU (%) 3D MinkNet14 [19] CVPR2019 voxel 1 76.24 4D MinkNet14 [19] CVPR2019 voxel 3 77.46 MeteorNet-M [43] ICCV2019 point 2 81.47 MeteorNet-L [43] ICCV2019 point 3 81.80 PSTNet [44] ICLR2021 point 3 82.24 P4Transformer [3… view at source ↗
Figure 5
Figure 5. Figure 5: 4D semantic segmentation visualization on the Synthia 4D dataset. Key points are demarcated by red dashed circular bounding boxes. schemes [33], demonstrating accurate frame-level action clas￾sification. Our PointATA leverages the engineering-oriented PVA&SCE to harvest temporal cues, guiding the model to attend to motion at the frame level. Moreover, once the complex and time-consuming data augmentations … view at source ↗
Figure 6
Figure 6. Figure 6: Ablation studies of PointATA under various experimental settings. TABLE VII TRAINING AND INFERENCE ADVANTAGES OF POINTATA OVER BASELINE. ALL VALUES ARE RECORDED IN SECONDS. RESULTS ARE AVERAGED OVER 10 RUNS. T. MEANS TIME. Model Clip T. Training T. Test T. GPU Hours PointCSA 189.3 43 34 6.8 PointATA 238.4 36 22 5.6 -Stage 1 89.2 N/A N/A N/A -Stage 2 149.2 N/A N/A N/A ∆ +25.9% −16.3% −35.3% −17.6% patterns … view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE visualization showing that aligning static and dynamic clouds with the Point Align Embedder is essential. Smaller indices indicate tighter clusters. After alignment, intra-class distances shrink and inter-class gaps widen. “Align then Adapt” paradigm, we evaluate several tuning methods while keeping the pre-trained backbone frozen. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage 'Align then Adapt' (PointATA) paradigm for parameter-efficient transfer of 3D pre-trained models to 4D point cloud video tasks. Stage 1 uses optimal transport to quantify and close the 3D-4D distributional gap via a trainable point align embedder. Stage 2 freezes the backbone and adds a lightweight point-video adapter plus spatial-context encoder to enable temporal reasoning without overfitting. The central claim is that this yields performance matching or exceeding full fine-tuning at lower parameter cost, with reported results of 97.21% accuracy on 3D action recognition, +8.7% on 4D action segmentation, and 84.06% on 4D semantic segmentation.

Significance. If the empirical claims hold under rigorous validation, the work would be significant for robotics and 4D perception by providing a scalable way to leverage abundant 3D pre-trained models on scarce 4D data, explicitly addressing modality gap and overfitting through a principled two-stage design that maintains parameter efficiency.

major comments (2)
  1. Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.
  2. Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.
minor comments (2)
  1. Ensure all experimental tables report parameter counts, FLOPs, and training time alongside accuracy to substantiate the 'smaller parameter cost' advantage.
  2. Clarify the exact formulation of the optimal transport objective and the training protocol for the point align embedder (e.g., whether it is trained on paired 3D-4D samples or unpaired).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.

    Authors: We agree that the abstract would benefit from additional context to support the reported numbers. In the revised manuscript, we have added a brief clause indicating that the results are compared against full fine-tuning and prior parameter-efficient transfer methods on standard 4D benchmarks using established train/test splits. Detailed baselines, ablation studies, and split information remain in Sections 4 and 5. This change improves clarity without exceeding abstract length constraints. revision: yes

  2. Referee: Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.

    Authors: The current manuscript provides ablations on the point align embedder's contribution to performance gains (see Table 3 and Figure 4). However, we acknowledge that more targeted evidence on feature preservation would strengthen the central claim. In the revision, we will add a new subsection with quantitative analysis, including feature similarity metrics (e.g., MMD and cosine similarity) computed before and after alignment on held-out 4D sequences, along with t-SNE visualizations and generalization results on unseen video clips. This will explicitly demonstrate that 3D features are preserved and overfitting risks are mitigated. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical two-stage method (OT-based alignment via point align embedder in Stage 1, followed by lightweight adapters and spatial-context encoder on a frozen 3D backbone in Stage 2). No equations, derivations, or performance claims reduce to quantities defined solely by fitted parameters or self-citations internal to the paper. Optimal transport is invoked as standard external theory to quantify modality gap, and architectural choices are motivated by stated overfitting and distributional issues without creating self-definitional loops or renaming known results as novel predictions. The reported gains (e.g., 97.21% accuracy) are framed as outcomes of the proposed engineering designs rather than forced by construction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method introduces new modules (point align embedder, point-video adapter) whose internal hyperparameters or assumptions are not detailed.

pith-pipeline@v0.9.0 · 5588 in / 1155 out tokens · 35040 ms · 2026-05-15T18:42:25.530106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion Masked Pretraining for Dynamic Point Cloud

    cs.CV 2026-05 unverdicted novelty 7.0

    DiMP applies diffusion modeling to masked pretraining of dynamic point clouds to remove positional leakage and capture motion uncertainty, yielding 11.21% and 13.65% gains on offline and online action segmentation.

  2. Diffusion Masked Pretraining for Dynamic Point Cloud

    cs.CV 2026-05 unverdicted novelty 7.0

    DiMP uses diffusion to infer clean masked positions from visible context and to model full distributions of point displacements rather than means, delivering 11.21% and 13.65% absolute gains on offline and online acti...

  3. Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mantis is the first Mamba-native PEFT framework for 3D point cloud models that injects task signals into state-space updates via State-Aware Adapters and regularizes serialization with Dual-Serialization Consistency D...

  4. Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mantis is the first Mamba-native PEFT framework for 3D point cloud models, using state-aware adapters and dual-serialization distillation to match performance with only 5% trainable parameters.

  5. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    Hyper- point: Multimodal 3d foundation model in hyperbolic space,

    Y . Sun, H. Cheng, C. Lu, Z. Li, M. Wu, H. Lu, and J. Zhu, “Hyper- point: Multimodal 3d foundation model in hyperbolic space,”Pattern Recognit., vol. 173, p. 112800, 2026

  2. [2]

    Leaf: Learning frames for 4d point cloud sequence understanding,

    Y . Liu, J. Chen, Z. Zhang, J. Huang, and L. Yi, “Leaf: Learning frames for 4d point cloud sequence understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 604–613

  3. [3]

    Mamba4d: Efficient 4d point cloud video understanding with disentangled spatial-temporal state space models,

    J. Liu, J. Han, L. Liu, A. I. Aviles-Rivero, C. Jiang, Z. Liu, and H. Wang, “Mamba4d: Efficient 4d point cloud video understanding with disentangled spatial-temporal state space models,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 17 626–17 636

  4. [4]

    Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,

    Z. Shen, X. Sheng, L. Wang, Y . Guo, Q. Liu, and X. Zhou, “Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1212–1222

  5. [5]

    Vg4d: Vision- language model goes 4d video recognition,

    Z. Deng, X. Li, X. Li, Y . Tong, S. Zhao, and M. Liu, “Vg4d: Vision- language model goes 4d video recognition,” inIEEE Int. Conf. Robot. Autom.IEEE, 2024, pp. 5014–5020

  6. [6]

    Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,

    Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., October 2023, pp. 16 580–16 589

  7. [7]

    ShapeNet: An Information-Rich 3D Model Repository

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Suet al., “Shapenet: An information- rich 3d model repository,”arXiv preprint arXiv:1512.03012, 2015

  8. [8]

    Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,

    M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1588–1597

  9. [9]

    3d shapenets: A deep representation for volumetric shapes,

    Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1912–1920

  10. [10]

    Adapting pre-trained 3d models for point cloud video understanding via cross- frame spatio-temporal perception,

    B. Lv, Y . Zha, T. Dai, X. Yuerong, K. Chen, and S.-T. Xia, “Adapting pre-trained 3d models for point cloud video understanding via cross- frame spatio-temporal perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 12 413–12 422

  11. [11]

    Learning modality knowledge alignment for cross-modality transfer,

    W. Ma, S. Li, L. Cai, and J. Kang, “Learning modality knowledge alignment for cross-modality transfer,” inProc. Int. Conf. Mach. Learn. PMLR, 2024, pp. 33 777–33 793

  12. [12]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2021

  13. [13]

    St-adapter: Parameter- efficient image-to-video transfer learning,

    J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “St-adapter: Parameter- efficient image-to-video transfer learning,”Proc. Adv. Neural Inf. Pro- cess. Syst., vol. 35, pp. 26 462–26 477, 2022

  14. [14]

    Aim: Adapting image models for efficient video action recognition,

    T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video action recognition,” inProc. Int. Conf. Learn. Representations, 2023

  15. [15]

    Geometric dataset distances via optimal transport,

    D. Alvarez-Melis and N. Fusi, “Geometric dataset distances via optimal transport,”Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 21 428– 21 439, 2020

  16. [16]

    Pointnet: Deep learning on point sets for 3d classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660

  17. [17]

    Action recognition based on a bag of 3d points,

    W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, 2010, pp. 9–14

  18. [18]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction,

    Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2022, pp. 21 013–21 022

  19. [19]

    4d spatio-temporal convnets: Minkowski convolutional neural networks,

    C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” inProc. IEEE Conf. Com- put. Vis. Pattern Recognit., 2019, pp. 3075–3084

  20. [20]

    Context-aware 3d point cloud semantic segmentation with plane guidance,

    T. Weng, J. Xiao, F. Yan, and H. Jiang, “Context-aware 3d point cloud semantic segmentation with plane guidance,”IEEE Trans. Multimedia, vol. 25, pp. 6653–6664, 2023

  21. [21]

    Self-supervised point cloud representation learning via separating mixed shapes,

    C. Sun, Z. Zheng, X. Wang, M. Xu, and Y . Yang, “Self-supervised point cloud representation learning via separating mixed shapes,”IEEE Trans. Multimedia, vol. 25, pp. 6207–6218, 2023

  22. [22]

    Dual transformer for point cloud analysis,

    X.-F. Han, Y .-F. Jin, H.-X. Cheng, and G.-Q. Xiao, “Dual transformer for point cloud analysis,”IEEE Trans. Multimedia, vol. 25, pp. 5638–5648, 2023

  23. [23]

    Mpct: Multiscale point cloud transformer with a residual network,

    Y . Wu, J. Liu, M. Gong, Z. Liu, Q. Miao, and W. Ma, “Mpct: Multiscale point cloud transformer with a residual network,”IEEE Trans. Multimedia, vol. 26, pp. 3505–3516, 2024

  24. [24]

    Hyperbolic image-and- pointcloud contrastive learning for 3d classification,

    N. Hu, H. Cheng, Y . Xie, P. Shi, and J. Zhu, “Hyperbolic image-and- pointcloud contrastive learning for 3d classification,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst.IEEE, 2024, pp. 4973–4979

  25. [25]

    EDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation,

    H. Cheng, J. Zhu, J. Lu, and X. Han, “EDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation,”Expert Syst. Appl., vol. 237, p. 121551, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  26. [26]

    Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,

    S. Xie, J. Gu, D. Guo, C. R. Qi, and L. G. O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” inEur. Conf. Comput. Vis., 2020

  27. [27]

    Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,

    M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thi- lakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9902–9912

  28. [28]

    Masked autoencoders for point cloud self-supervised learning,

    Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” inEur. Conf. Comput. Vis., 2022, pp. 604–621

  29. [29]

    Pointgpt: Auto-regressively generative pre-training from point clouds,

    G. Chen, M. Wang, Y . Yang, K. Yu, L. Yuan, and Y . Yue, “Pointgpt: Auto-regressively generative pre-training from point clouds,”Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2024

  30. [30]

    Meteornet: Deep learning on dynamic 3d point cloud sequences,

    X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 9245–9254

  31. [31]

    Point 4d transformer networks for spatio-temporal modeling in point cloud videos,

    H. Fan, Y . Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

  32. [32]

    Point primitive transformer for long-term 4d point cloud video understanding,

    H. Wen, Y . Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4d point cloud video understanding,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 19–35

  33. [33]

    Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning,

    Z. Zhang, Y . Dong, Y . Liu, and L. Yi, “Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 661– 17 670

  34. [34]

    Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,

    Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 16 580–16 589

  35. [35]

    Cross-modal fine-tuning: Align then refine,

    J. Shen, L. Li, L. M. Dery, C. Staten, M. Khodak, G. Neubig, and A. Talwalkar, “Cross-modal fine-tuning: Align then refine,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 31 030–31 056

  36. [36]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022

  37. [37]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763

  38. [38]

    Instance- aware dynamic prompt tuning for pre-trained point cloud models,

    Y . Zha, J. Wang, T. Dai, B. Chen, Z. Wang, and S.-T. Xia, “Instance- aware dynamic prompt tuning for pre-trained point cloud models,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 14 161–14 170

  39. [39]

    Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,

    X. Zhou, D. Liang, W. Xu, X. Zhu, Y . Xu, Z. Zou, and X. Bai, “Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 707–14 717

  40. [40]

    Sinkhorn distances: lightspeed computation of optimal transport,

    M. Cuturi, “Sinkhorn distances: lightspeed computation of optimal transport,” inProc. Adv. Neural Inf. Process. Syst., 2013

  41. [41]

    Parameter-efficient transfer learning for NLP,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inInt. Conf. Mach. Learn., vol. 97. PMLR, 09–15 Jun 2019, pp. 2790–2799

  42. [42]

    Point-bert: Pre- training 3d point cloud transformers with masked point modeling,

    X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre- training 3d point cloud transformers with masked point modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

  43. [43]

    Meteornet: Deep learning on dynamic 3d point cloud sequences,

    X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019

  44. [44]

    Pstnet: point spatio-temporal convolution on point cloud sequences,

    H. Fan, X. Yu, Y . Ding, Y . Yang, and M. Kankanhalli, “Pstnet: point spatio-temporal convolution on point cloud sequences,” inProc. Int. Conf. Learn. Representations, 2021

  45. [45]

    No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,

    J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8510–8520

  46. [46]

    Point spatio-temporal transformer networks for point cloud video modeling,

    H. Fan, Y . Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2023

  47. [47]

    X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,

    L. Jing, Y . Xue, X. Yan, C. Zheng, D. Wang, R. Zhang, Z. Wang, H. Fang, B. Zhao, and Z. Li, “X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,” inProc. AAAI Conf. Artif. Intell., vol. 38, no. 3, 2024, pp. 2670–2678

  48. [48]

    Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,

    X. Sheng, Z. Shen, and G. Xiao, “Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 8, 2023, pp. 9802–9810

  49. [49]

    Point contrastive prediction with semantic clustering for self-supervised learn- ing on point cloud videos,

    X. Sheng, Z. Shen, G. Xiao, L. Wang, Y . Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learn- ing on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 16 515–16 524

  50. [50]

    Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1010–1019

  51. [51]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” inProc. Adv. Neural Inf. Process. Syst., 2017

  52. [52]

    3dv: 3d dynamic voxel for action recognition in depth video,

    Y . Wang, Y . Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 508–517

  53. [53]

    Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,

    Y . Liu, C. Chen, Z. Wang, and L. Yi, “Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,” inIEEE Int. Conf. Robot. Autom., 2024, pp. 12 436–12 442

  54. [54]

    An efficient pointlstm for point clouds based gesture recognition,

    Y . Min, Y . Zhang, X. Chai, and X. Chen, “An efficient pointlstm for point clouds based gesture recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5760–5769

  55. [55]

    Flownet3d: Learning scene flow in 3d point clouds,

    X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 529–537

  56. [56]

    Flot: Scene flow on point clouds guided by optimal transport,

    G. Puy, A. Boulch, and R. Marlet, “Flot: Scene flow on point clouds guided by optimal transport,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 527–544

  57. [57]

    Occlusion guided scene flow estimation on 3d point clouds,

    B. Ouyang and D. Raviv, “Occlusion guided scene flow estimation on 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2805–2814

  58. [58]

    Festa: Flow estimation via spatial-temporal attention for scene point clouds,

    H. Wang, J. Pang, M. A. Lodhi, Y . Tian, and D. Tian, “Festa: Flow estimation via spatial-temporal attention for scene point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 173– 14 182

  59. [59]

    What matters for 3d scene flow network,

    G. Wang, Y . Hu, Z. Liu, Y . Zhou, M. Tomizuka, W. Zhan, and H. Wang, “What matters for 3d scene flow network,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 38–55

  60. [60]

    Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,

    W. Cheng and J. H. Ko, “Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 108–124

  61. [61]

    Scoop: Self- supervised correspondence and optimization-based scene flow,

    I. Lang, D. Aiger, F. Cole, S. Avidan, and M. Rubinstein, “Scoop: Self- supervised correspondence and optimization-based scene flow,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 5281–5290

  62. [62]

    Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation,

    W. Cheng and J. H. Ko, “Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 10 007–10 016

  63. [63]

    Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,

    J. Liu, G. Wang, W. Ye, C. Jiang, J. Han, Z. Liu, G. Zhang, D. Du, and H. Wang, “Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,”arXiv preprint arXiv:2311.17456, 2023

  64. [64]

    SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,

    Q. de Smedt, H. Wannous, J.-P. Vandeborre, J. Guerry, B. Le Saux, and D. Filliat, “SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,” inProc. 10th Eurographics Workshop on 3D Object Retrieval, Apr. 2017, pp. 1–6

  65. [65]

    Object scene flow for autonomous vehicles,

    M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3061–3070

  66. [66]

    Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,

    H. Cheng, J. Zhu, N. Hu, J. Chen, and W. Yan, “Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 12, pp. 12 158–12 170, 2024

  67. [67]

    Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,

    R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,” inProc. Adv. Neural Inf. Process. Syst., 2022

  68. [68]

    Benchmarking optimization software with performance profiles,

    E. D. Dolan and J. J. Mor ´e, “Benchmarking optimization software with performance profiles,”Math. Program., vol. 91, no. 2, pp. 201–213, 2002

  69. [69]

    A kernel method for the two-sample-problem,

    A. Gretton, K. Borgwardt, M. Rasch, B. Sch ¨olkopf, and A. Smola, “A kernel method for the two-sample-problem,”Adv. Neural Inf. Process. Syst., vol. 19, 2006

  70. [70]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inProc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 3519–3529

  71. [71]

    Visual prompt tuning,

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 709–727

  72. [72]

    Parameter- efficient fine-tuning in spectral domain for point cloud learning,

    D. Liang, T. Feng, X. Zhou, Y . Zhang, Z. Zou, and X. Bai, “Parameter- efficient fine-tuning in spectral domain for point cloud learning,”IEEE Trans. Pattern Anal. Mach. Intell., 2025