arxiv: 2602.23069 · v2 · submitted 2026-02-26 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun , Jihua Zhu , Haozhe Cheng , Chaoyi Lu , Zhichuan Yang , Lin Chen , Yaonan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords parameter-efficient transfer learningpoint cloud video4D perceptionoptimal transport3D to 4D adaptationaction recognitionsemantic segmentationtemporal modeling

0 comments

The pith

PointATA transfers pre-trained 3D models to 4D point cloud video tasks by first aligning distributions with optimal transport then adding lightweight temporal adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that 4D point cloud video data is scarce compared to 3D data, which limits scalable self-supervised models, and that direct transfer from 3D pre-trained models suffers from overfitting and a modality gap. It proposes a two-stage Align then Adapt process where optimal transport first quantifies and reduces the 3D-4D distribution difference via a point align embedder, then a frozen 3D backbone receives a point-video adapter and spatial-context encoder for temporal capacity. This setup allows models without built-in temporal knowledge to handle dynamic scenes at low parameter cost. A sympathetic reader would care because it offers a practical route to strong 4D performance in robotics without requiring massive new 4D datasets or full retraining.

Core claim

PointATA decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory quantifies the distributional discrepancy between 3D and 4D datasets so that a point align embedder can be trained in stage one to close the modality gap. In stage two an efficient point-video adapter and a spatial-context encoder are added to the frozen 3D backbone to supply temporal modeling while avoiding overfitting. With these designs a pre-trained 3D model without temporal knowledge can reason about dynamic video content at lower parameter cost than prior work, matching or exceeding full fine-tuning on action recognition, action segmentation, and semantic segmentation.

What carries the argument

The Align then Adapt paradigm: optimal transport measures the 3D-4D gap to train the point align embedder in stage one, while the point-video adapter and spatial-context encoder supply temporal modeling in stage two on a frozen backbone.

If this is right

A pre-trained 3D model reaches 97.21 percent accuracy on 3D action recognition after the two-stage transfer.
The method improves 4D action segmentation by 8.7 percent over earlier parameter-efficient baselines.
4D semantic segmentation accuracy reaches 84.06 percent while training far fewer parameters than full fine-tuning.
The frozen backbone plus lightweight adapters reduces overfitting risk relative to unfrozen adaptation on video data.
The approach matches or exceeds full fine-tuning performance on 4D tasks at substantially lower parameter cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimal transport alignment may extend to other 3D-to-4D transfers such as medical or driving scenes.
If the align stage generalizes, the volume of 4D data needed for competitive robotics models could drop.
Applying the same two-stage split to other modality gaps, such as image-to-video or static-to-dynamic, would be a direct test.

Load-bearing premise

The distributional discrepancy between 3D and 4D datasets can be effectively quantified by optimal transport and alleviated by the point align embedder without degrading the pre-trained 3D features or introducing new overfitting risks.

What would settle it

An experiment in which the point align embedder increases the measured modality gap or produces lower 4D task accuracy than a direct-adaptation baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.23069 by Chaoyi Lu, Haozhe Cheng, Jihua Zhu, Lin Chen, Yaonan Wang, Yiding Sun, Zhichuan Yang.

**Figure 1.** Figure 1: Current 4D PETL methods face two limits. Upper panel: Adapters are feasible (Smiley), yet current methods expose models to severe overfitting (Crying). Lower panel: Cross-modal transfer needs prior alignment. This practice is mature in 2D Vision and NLP. But to our knowledge, no 4D PETL study measures this gap (Crying). Without considering the gap, it will hurt downstream performance undoubtedly (Crying). … view at source ↗

**Figure 2.** Figure 2: Comparison of 3D full fine-tuning, 4D full fine-tuning, 4D adapter tuning, and our PointATA. Considering that the embedder required for 4D encoding is often heavier than that for 3D, when facing a 4D dynamic perception task, the number of parameters that need to be updated is even greater (over 100%) than the full update for 3D. PointATA saves large amount of resource and time than full fine-tuning and sig… view at source ↗

**Figure 3.** Figure 3: PointATA employs a two-stage workflow to quickly adapt large 3D pre-trained models to diverse 4D downstream tasks. In Stage 1, it obtains 4D features via the P4D embedder and learns by minimizing the distribution distance to 3D source features. The weight of P4D embedder is randomly initialized. In Stage 2, it jointly fine-tunes the P4D embedder and the PVA to minimize task loss. The 3D backbone remains fr… view at source ↗

**Figure 4.** Figure 4: Visualization of action segmentation. P4Transformer has a serious over-segmentation problem. TABLE IV 4D SEMANTIC SEGMENTATION RESULTS ON SYNTHIA 4D DATASET. Method Reference Input Frame mIoU (%) 3D MinkNet14 [19] CVPR2019 voxel 1 76.24 4D MinkNet14 [19] CVPR2019 voxel 3 77.46 MeteorNet-M [43] ICCV2019 point 2 81.47 MeteorNet-L [43] ICCV2019 point 3 81.80 PSTNet [44] ICLR2021 point 3 82.24 P4Transformer [3… view at source ↗

**Figure 5.** Figure 5: 4D semantic segmentation visualization on the Synthia 4D dataset. Key points are demarcated by red dashed circular bounding boxes. schemes [33], demonstrating accurate frame-level action classification. Our PointATA leverages the engineering-oriented PVA&SCE to harvest temporal cues, guiding the model to attend to motion at the frame level. Moreover, once the complex and time-consuming data augmentations … view at source ↗

**Figure 6.** Figure 6: Ablation studies of PointATA under various experimental settings. TABLE VII TRAINING AND INFERENCE ADVANTAGES OF POINTATA OVER BASELINE. ALL VALUES ARE RECORDED IN SECONDS. RESULTS ARE AVERAGED OVER 10 RUNS. T. MEANS TIME. Model Clip T. Training T. Test T. GPU Hours PointCSA 189.3 43 34 6.8 PointATA 238.4 36 22 5.6 -Stage 1 89.2 N/A N/A N/A -Stage 2 149.2 N/A N/A N/A ∆ +25.9% −16.3% −35.3% −17.6% patterns … view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: t-SNE visualization showing that aligning static and dynamic clouds with the Point Align Embedder is essential. Smaller indices indicate tighter clusters. After alignment, intra-class distances shrink and inter-class gaps widen. “Align then Adapt” paradigm, we evaluate several tuning methods while keeping the pre-trained backbone frozen. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

read the original abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PointATA splits 3D-to-4D transfer into OT-based alignment then lightweight temporal adapters on a frozen backbone, delivering parameter-efficient results that match full fine-tuning on the reported tasks.

read the letter

The main thing here is the two-stage split: first train a point align embedder with optimal transport to close the 3D-4D distributional gap, then freeze the 3D backbone and add a point-video adapter plus spatial-context encoder for temporal reasoning. This keeps parameter cost low while letting the model handle dynamic point cloud videos it was never trained on. They report 97.21% on action recognition and clear gains on 4D segmentation, which lines up with the goal of using abundant 3D data to offset scarce 4D sets in robotics settings. The design choices are explicit about targeting overfitting and modality mismatch, and the engineering feels practical rather than overly complicated. What is new is the specific pairing of OT alignment with these point-video components for 4D transfer; it is not just another adapter paper. The soft spots are limited. The abstract gives headline numbers but leaves the full baseline comparisons, ablations, and dataset splits for the manuscript to show. If those hold up, the central claim is fine; if the gains shrink under stricter controls, the efficiency story weakens but does not collapse. No circularity or unsupported math jumps out. This is for researchers working on efficient transfer in 3D/4D point cloud vision who need concrete recipes for data-scarce settings. It deserves peer review because the motivation is clear, the method is testable, and the performance claims are specific enough to evaluate properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage 'Align then Adapt' (PointATA) paradigm for parameter-efficient transfer of 3D pre-trained models to 4D point cloud video tasks. Stage 1 uses optimal transport to quantify and close the 3D-4D distributional gap via a trainable point align embedder. Stage 2 freezes the backbone and adds a lightweight point-video adapter plus spatial-context encoder to enable temporal reasoning without overfitting. The central claim is that this yields performance matching or exceeding full fine-tuning at lower parameter cost, with reported results of 97.21% accuracy on 3D action recognition, +8.7% on 4D action segmentation, and 84.06% on 4D semantic segmentation.

Significance. If the empirical claims hold under rigorous validation, the work would be significant for robotics and 4D perception by providing a scalable way to leverage abundant 3D pre-trained models on scarce 4D data, explicitly addressing modality gap and overfitting through a principled two-stage design that maintains parameter efficiency.

major comments (2)

Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.
Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.

minor comments (2)

Ensure all experimental tables report parameter counts, FLOPs, and training time alongside accuracy to substantiate the 'smaller parameter cost' advantage.
Clarify the exact formulation of the optimal transport objective and the training protocol for the point align embedder (e.g., whether it is trained on paired 3D-4D samples or unpaired).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.

Authors: We agree that the abstract would benefit from additional context to support the reported numbers. In the revised manuscript, we have added a brief clause indicating that the results are compared against full fine-tuning and prior parameter-efficient transfer methods on standard 4D benchmarks using established train/test splits. Detailed baselines, ablation studies, and split information remain in Sections 4 and 5. This change improves clarity without exceeding abstract length constraints. revision: yes
Referee: Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.

Authors: The current manuscript provides ablations on the point align embedder's contribution to performance gains (see Table 3 and Figure 4). However, we acknowledge that more targeted evidence on feature preservation would strengthen the central claim. In the revision, we will add a new subsection with quantitative analysis, including feature similarity metrics (e.g., MMD and cosine similarity) computed before and after alignment on held-out 4D sequences, along with t-SNE visualizations and generalization results on unseen video clips. This will explicitly demonstrate that 3D features are preserved and overfitting risks are mitigated. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical two-stage method (OT-based alignment via point align embedder in Stage 1, followed by lightweight adapters and spatial-context encoder on a frozen 3D backbone in Stage 2). No equations, derivations, or performance claims reduce to quantities defined solely by fitted parameters or self-citations internal to the paper. Optimal transport is invoked as standard external theory to quantify modality gap, and architectural choices are motivated by stated overfitting and distributional issues without creating self-definitional loops or renaming known results as novel predictions. The reported gains (e.g., 97.21% accuracy) are framed as outcomes of the proposed engineering designs rather than forced by construction from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method introduces new modules (point align embedder, point-video adapter) whose internal hyperparameters or assumptions are not detailed.

pith-pipeline@v0.9.0 · 5588 in / 1155 out tokens · 35040 ms · 2026-05-15T18:42:25.530106+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusion Masked Pretraining for Dynamic Point Cloud
cs.CV 2026-05 unverdicted novelty 7.0

DiMP applies diffusion modeling to masked pretraining of dynamic point clouds to remove positional leakage and capture motion uncertainty, yielding 11.21% and 13.65% gains on offline and online action segmentation.
Diffusion Masked Pretraining for Dynamic Point Cloud
cs.CV 2026-05 unverdicted novelty 7.0

DiMP uses diffusion to infer clean masked positions from visible context and to model full distributions of point displacements rather than means, delivering 11.21% and 13.65% absolute gains on offline and online acti...
Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models
cs.CV 2026-05 unverdicted novelty 7.0

Mantis is the first Mamba-native PEFT framework for 3D point cloud models that injects task signals into state-space updates via State-Aware Adapters and regularizes serialization with Dual-Serialization Consistency D...
Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models
cs.CV 2026-05 unverdicted novelty 7.0

Mantis is the first Mamba-native PEFT framework for 3D point cloud models, using state-aware adapters and dual-serialization distillation to match performance with only 5% trainable parameters.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 3 Pith papers · 1 internal anchor

[1]

Hyper- point: Multimodal 3d foundation model in hyperbolic space,

Y . Sun, H. Cheng, C. Lu, Z. Li, M. Wu, H. Lu, and J. Zhu, “Hyper- point: Multimodal 3d foundation model in hyperbolic space,”Pattern Recognit., vol. 173, p. 112800, 2026

work page 2026
[2]

Leaf: Learning frames for 4d point cloud sequence understanding,

Y . Liu, J. Chen, Z. Zhang, J. Huang, and L. Yi, “Leaf: Learning frames for 4d point cloud sequence understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 604–613

work page 2023
[3]

Mamba4d: Efficient 4d point cloud video understanding with disentangled spatial-temporal state space models,

J. Liu, J. Han, L. Liu, A. I. Aviles-Rivero, C. Jiang, Z. Liu, and H. Wang, “Mamba4d: Efficient 4d point cloud video understanding with disentangled spatial-temporal state space models,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 17 626–17 636

work page 2025
[4]

Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,

Z. Shen, X. Sheng, L. Wang, Y . Guo, Q. Liu, and X. Zhou, “Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1212–1222

work page 2023
[5]

Vg4d: Vision- language model goes 4d video recognition,

Z. Deng, X. Li, X. Li, Y . Tong, S. Zhao, and M. Liu, “Vg4d: Vision- language model goes 4d video recognition,” inIEEE Int. Conf. Robot. Autom.IEEE, 2024, pp. 5014–5020

work page 2024
[6]

Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,

Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., October 2023, pp. 16 580–16 589

work page 2023
[7]

ShapeNet: An Information-Rich 3D Model Repository

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Suet al., “Shapenet: An information- rich 3d model repository,”arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,

M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1588–1597

work page 2019
[9]

3d shapenets: A deep representation for volumetric shapes,

Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1912–1920

work page 2015
[10]

Adapting pre-trained 3d models for point cloud video understanding via cross- frame spatio-temporal perception,

B. Lv, Y . Zha, T. Dai, X. Yuerong, K. Chen, and S.-T. Xia, “Adapting pre-trained 3d models for point cloud video understanding via cross- frame spatio-temporal perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 12 413–12 422

work page 2025
[11]

Learning modality knowledge alignment for cross-modality transfer,

W. Ma, S. Li, L. Cai, and J. Kang, “Learning modality knowledge alignment for cross-modality transfer,” inProc. Int. Conf. Mach. Learn. PMLR, 2024, pp. 33 777–33 793

work page 2024
[12]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2021

work page 2021
[13]

St-adapter: Parameter- efficient image-to-video transfer learning,

J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “St-adapter: Parameter- efficient image-to-video transfer learning,”Proc. Adv. Neural Inf. Pro- cess. Syst., vol. 35, pp. 26 462–26 477, 2022

work page 2022
[14]

Aim: Adapting image models for efficient video action recognition,

T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video action recognition,” inProc. Int. Conf. Learn. Representations, 2023

work page 2023
[15]

Geometric dataset distances via optimal transport,

D. Alvarez-Melis and N. Fusi, “Geometric dataset distances via optimal transport,”Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 21 428– 21 439, 2020

work page 2020
[16]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660

work page 2017
[17]

Action recognition based on a bag of 3d points,

W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, 2010, pp. 9–14

work page 2010
[18]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction,

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2022, pp. 21 013–21 022

work page 2022
[19]

4d spatio-temporal convnets: Minkowski convolutional neural networks,

C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” inProc. IEEE Conf. Com- put. Vis. Pattern Recognit., 2019, pp. 3075–3084

work page 2019
[20]

Context-aware 3d point cloud semantic segmentation with plane guidance,

T. Weng, J. Xiao, F. Yan, and H. Jiang, “Context-aware 3d point cloud semantic segmentation with plane guidance,”IEEE Trans. Multimedia, vol. 25, pp. 6653–6664, 2023

work page 2023
[21]

Self-supervised point cloud representation learning via separating mixed shapes,

C. Sun, Z. Zheng, X. Wang, M. Xu, and Y . Yang, “Self-supervised point cloud representation learning via separating mixed shapes,”IEEE Trans. Multimedia, vol. 25, pp. 6207–6218, 2023

work page 2023
[22]

Dual transformer for point cloud analysis,

X.-F. Han, Y .-F. Jin, H.-X. Cheng, and G.-Q. Xiao, “Dual transformer for point cloud analysis,”IEEE Trans. Multimedia, vol. 25, pp. 5638–5648, 2023

work page 2023
[23]

Mpct: Multiscale point cloud transformer with a residual network,

Y . Wu, J. Liu, M. Gong, Z. Liu, Q. Miao, and W. Ma, “Mpct: Multiscale point cloud transformer with a residual network,”IEEE Trans. Multimedia, vol. 26, pp. 3505–3516, 2024

work page 2024
[24]

Hyperbolic image-and- pointcloud contrastive learning for 3d classification,

N. Hu, H. Cheng, Y . Xie, P. Shi, and J. Zhu, “Hyperbolic image-and- pointcloud contrastive learning for 3d classification,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst.IEEE, 2024, pp. 4973–4979

work page 2024
[25]

EDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation,

H. Cheng, J. Zhu, J. Lu, and X. Han, “EDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation,”Expert Syst. Appl., vol. 237, p. 121551, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

work page 2024
[26]

Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,

S. Xie, J. Gu, D. Guo, C. R. Qi, and L. G. O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” inEur. Conf. Comput. Vis., 2020

work page 2020
[27]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,

M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thi- lakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9902–9912

work page 2022
[28]

Masked autoencoders for point cloud self-supervised learning,

Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” inEur. Conf. Comput. Vis., 2022, pp. 604–621

work page 2022
[29]

Pointgpt: Auto-regressively generative pre-training from point clouds,

G. Chen, M. Wang, Y . Yang, K. Yu, L. Yuan, and Y . Yue, “Pointgpt: Auto-regressively generative pre-training from point clouds,”Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2024

work page 2024
[30]

Meteornet: Deep learning on dynamic 3d point cloud sequences,

X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 9245–9254

work page 2019
[31]

Point 4d transformer networks for spatio-temporal modeling in point cloud videos,

H. Fan, Y . Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021

work page 2021
[32]

Point primitive transformer for long-term 4d point cloud video understanding,

H. Wen, Y . Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4d point cloud video understanding,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 19–35

work page 2022
[33]

Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning,

Z. Zhang, Y . Dong, Y . Liu, and L. Yi, “Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 661– 17 670

work page 2023
[34]

Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,

Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 16 580–16 589

work page 2023
[35]

Cross-modal fine-tuning: Align then refine,

J. Shen, L. Li, L. M. Dery, C. Staten, M. Khodak, G. Neubig, and A. Talwalkar, “Cross-modal fine-tuning: Align then refine,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 31 030–31 056

work page 2023
[36]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022

work page 2022
[37]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763

work page 2021
[38]

Instance- aware dynamic prompt tuning for pre-trained point cloud models,

Y . Zha, J. Wang, T. Dai, B. Chen, Z. Wang, and S.-T. Xia, “Instance- aware dynamic prompt tuning for pre-trained point cloud models,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 14 161–14 170

work page 2023
[39]

Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,

X. Zhou, D. Liang, W. Xu, X. Zhu, Y . Xu, Z. Zou, and X. Bai, “Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 707–14 717

work page 2024
[40]

Sinkhorn distances: lightspeed computation of optimal transport,

M. Cuturi, “Sinkhorn distances: lightspeed computation of optimal transport,” inProc. Adv. Neural Inf. Process. Syst., 2013

work page 2013
[41]

Parameter-efficient transfer learning for NLP,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inInt. Conf. Mach. Learn., vol. 97. PMLR, 09–15 Jun 2019, pp. 2790–2799

work page 2019
[42]

Point-bert: Pre- training 3d point cloud transformers with masked point modeling,

X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre- training 3d point cloud transformers with masked point modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022

work page 2022
[43]

Meteornet: Deep learning on dynamic 3d point cloud sequences,

X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019

work page 2019
[44]

Pstnet: point spatio-temporal convolution on point cloud sequences,

H. Fan, X. Yu, Y . Ding, Y . Yang, and M. Kankanhalli, “Pstnet: point spatio-temporal convolution on point cloud sequences,” inProc. Int. Conf. Learn. Representations, 2021

work page 2021
[45]

No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,

J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8510–8520

work page 2022
[46]

Point spatio-temporal transformer networks for point cloud video modeling,

H. Fan, Y . Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2023

work page 2023
[47]

X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,

L. Jing, Y . Xue, X. Yan, C. Zheng, D. Wang, R. Zhang, Z. Wang, H. Fang, B. Zhao, and Z. Li, “X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,” inProc. AAAI Conf. Artif. Intell., vol. 38, no. 3, 2024, pp. 2670–2678

work page 2024
[48]

Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,

X. Sheng, Z. Shen, and G. Xiao, “Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 8, 2023, pp. 9802–9810

work page 2023
[49]

Point contrastive prediction with semantic clustering for self-supervised learn- ing on point cloud videos,

X. Sheng, Z. Shen, G. Xiao, L. Wang, Y . Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learn- ing on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 16 515–16 524

work page 2023
[50]

Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1010–1019

work page 2016
[51]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” inProc. Adv. Neural Inf. Process. Syst., 2017

work page 2017
[52]

3dv: 3d dynamic voxel for action recognition in depth video,

Y . Wang, Y . Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 508–517

work page 2020
[53]

Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,

Y . Liu, C. Chen, Z. Wang, and L. Yi, “Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,” inIEEE Int. Conf. Robot. Autom., 2024, pp. 12 436–12 442

work page 2024
[54]

An efficient pointlstm for point clouds based gesture recognition,

Y . Min, Y . Zhang, X. Chai, and X. Chen, “An efficient pointlstm for point clouds based gesture recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5760–5769

work page 2020
[55]

Flownet3d: Learning scene flow in 3d point clouds,

X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 529–537

work page 2019
[56]

Flot: Scene flow on point clouds guided by optimal transport,

G. Puy, A. Boulch, and R. Marlet, “Flot: Scene flow on point clouds guided by optimal transport,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 527–544

work page 2020
[57]

Occlusion guided scene flow estimation on 3d point clouds,

B. Ouyang and D. Raviv, “Occlusion guided scene flow estimation on 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2805–2814

work page 2021
[58]

Festa: Flow estimation via spatial-temporal attention for scene point clouds,

H. Wang, J. Pang, M. A. Lodhi, Y . Tian, and D. Tian, “Festa: Flow estimation via spatial-temporal attention for scene point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 173– 14 182

work page 2021
[59]

What matters for 3d scene flow network,

G. Wang, Y . Hu, Z. Liu, Y . Zhou, M. Tomizuka, W. Zhan, and H. Wang, “What matters for 3d scene flow network,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 38–55

work page 2022
[60]

Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,

W. Cheng and J. H. Ko, “Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 108–124

work page 2022
[61]

Scoop: Self- supervised correspondence and optimization-based scene flow,

I. Lang, D. Aiger, F. Cole, S. Avidan, and M. Rubinstein, “Scoop: Self- supervised correspondence and optimization-based scene flow,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 5281–5290

work page 2023
[62]

Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation,

W. Cheng and J. H. Ko, “Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 10 007–10 016

work page 2023
[63]

Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,

J. Liu, G. Wang, W. Ye, C. Jiang, J. Han, Z. Liu, G. Zhang, D. Du, and H. Wang, “Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,”arXiv preprint arXiv:2311.17456, 2023

work page arXiv 2023
[64]

SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,

Q. de Smedt, H. Wannous, J.-P. Vandeborre, J. Guerry, B. Le Saux, and D. Filliat, “SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,” inProc. 10th Eurographics Workshop on 3D Object Retrieval, Apr. 2017, pp. 1–6

work page 2017
[65]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3061–3070

work page 2015
[66]

Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,

H. Cheng, J. Zhu, N. Hu, J. Chen, and W. Yan, “Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 12, pp. 12 158–12 170, 2024

work page 2024
[67]

Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,

R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,” inProc. Adv. Neural Inf. Process. Syst., 2022

work page 2022
[68]

Benchmarking optimization software with performance profiles,

E. D. Dolan and J. J. Mor ´e, “Benchmarking optimization software with performance profiles,”Math. Program., vol. 91, no. 2, pp. 201–213, 2002

work page 2002
[69]

A kernel method for the two-sample-problem,

A. Gretton, K. Borgwardt, M. Rasch, B. Sch ¨olkopf, and A. Smola, “A kernel method for the two-sample-problem,”Adv. Neural Inf. Process. Syst., vol. 19, 2006

work page 2006
[70]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inProc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 3519–3529

work page 2019
[71]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 709–727

work page 2022
[72]

Parameter- efficient fine-tuning in spectral domain for point cloud learning,

D. Liang, T. Feng, X. Zhou, Y . Zhang, Z. Zou, and X. Bai, “Parameter- efficient fine-tuning in spectral domain for point cloud learning,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

work page 2025