Recognition: 2 theorem links
· Lean TheoremAlign then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception
Pith reviewed 2026-05-15 18:42 UTC · model grok-4.3
The pith
PointATA transfers pre-trained 3D models to 4D point cloud video tasks by first aligning distributions with optimal transport then adding lightweight temporal adapters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PointATA decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory quantifies the distributional discrepancy between 3D and 4D datasets so that a point align embedder can be trained in stage one to close the modality gap. In stage two an efficient point-video adapter and a spatial-context encoder are added to the frozen 3D backbone to supply temporal modeling while avoiding overfitting. With these designs a pre-trained 3D model without temporal knowledge can reason about dynamic video content at lower parameter cost than prior work, matching or exceeding full fine-tuning on action recognition, action segmentation, and semantic segmentation.
What carries the argument
The Align then Adapt paradigm: optimal transport measures the 3D-4D gap to train the point align embedder in stage one, while the point-video adapter and spatial-context encoder supply temporal modeling in stage two on a frozen backbone.
If this is right
- A pre-trained 3D model reaches 97.21 percent accuracy on 3D action recognition after the two-stage transfer.
- The method improves 4D action segmentation by 8.7 percent over earlier parameter-efficient baselines.
- 4D semantic segmentation accuracy reaches 84.06 percent while training far fewer parameters than full fine-tuning.
- The frozen backbone plus lightweight adapters reduces overfitting risk relative to unfrozen adaptation on video data.
- The approach matches or exceeds full fine-tuning performance on 4D tasks at substantially lower parameter cost.
Where Pith is reading between the lines
- Optimal transport alignment may extend to other 3D-to-4D transfers such as medical or driving scenes.
- If the align stage generalizes, the volume of 4D data needed for competitive robotics models could drop.
- Applying the same two-stage split to other modality gaps, such as image-to-video or static-to-dynamic, would be a direct test.
Load-bearing premise
The distributional discrepancy between 3D and 4D datasets can be effectively quantified by optimal transport and alleviated by the point align embedder without degrading the pre-trained 3D features or introducing new overfitting risks.
What would settle it
An experiment in which the point align embedder increases the measured modality gap or produces lower 4D task accuracy than a direct-adaptation baseline would falsify the central claim.
Figures
read the original abstract
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage 'Align then Adapt' (PointATA) paradigm for parameter-efficient transfer of 3D pre-trained models to 4D point cloud video tasks. Stage 1 uses optimal transport to quantify and close the 3D-4D distributional gap via a trainable point align embedder. Stage 2 freezes the backbone and adds a lightweight point-video adapter plus spatial-context encoder to enable temporal reasoning without overfitting. The central claim is that this yields performance matching or exceeding full fine-tuning at lower parameter cost, with reported results of 97.21% accuracy on 3D action recognition, +8.7% on 4D action segmentation, and 84.06% on 4D semantic segmentation.
Significance. If the empirical claims hold under rigorous validation, the work would be significant for robotics and 4D perception by providing a scalable way to leverage abundant 3D pre-trained models on scarce 4D data, explicitly addressing modality gap and overfitting through a principled two-stage design that maintains parameter efficiency.
major comments (2)
- Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.
- Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.
minor comments (2)
- Ensure all experimental tables report parameter counts, FLOPs, and training time alongside accuracy to substantiate the 'smaller parameter cost' advantage.
- Clarify the exact formulation of the optimal transport objective and the training protocol for the point align embedder (e.g., whether it is trained on paired 3D-4D samples or unpaired).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: Abstract: the concrete performance numbers (97.21% accuracy, +8.7% segmentation) are presented without any description of the baselines, ablation controls, statistical tests, or train/validation/test splits used; these details are load-bearing for the central claim that PointATA matches or exceeds full fine-tuning.
Authors: We agree that the abstract would benefit from additional context to support the reported numbers. In the revised manuscript, we have added a brief clause indicating that the results are compared against full fine-tuning and prior parameter-efficient transfer methods on standard 4D benchmarks using established train/test splits. Detailed baselines, ablation studies, and split information remain in Sections 4 and 5. This change improves clarity without exceeding abstract length constraints. revision: yes
-
Referee: Method section (point align embedder and Stage 1): the claim that optimal transport plus the embedder alleviates the modality gap without degrading frozen 3D features or introducing new overfitting risks is central to the paradigm but lacks targeted ablation evidence showing feature preservation and generalization on held-out 4D data.
Authors: The current manuscript provides ablations on the point align embedder's contribution to performance gains (see Table 3 and Figure 4). However, we acknowledge that more targeted evidence on feature preservation would strengthen the central claim. In the revision, we will add a new subsection with quantitative analysis, including feature similarity metrics (e.g., MMD and cosine similarity) computed before and after alignment on held-out 4D sequences, along with t-SNE visualizations and generalization results on unseen video clips. This will explicitly demonstrate that 3D features are preserved and overfitting risks are mitigated. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical two-stage method (OT-based alignment via point align embedder in Stage 1, followed by lightweight adapters and spatial-context encoder on a frozen 3D backbone in Stage 2). No equations, derivations, or performance claims reduce to quantities defined solely by fitted parameters or self-citations internal to the paper. Optimal transport is invoked as standard external theory to quantify modality gap, and architectural choices are motivated by stated overfitting and distributional issues without creating self-definitional loops or renaming known results as novel predictions. The reported gains (e.g., 97.21% accuracy) are framed as outcomes of the proposed engineering designs rather than forced by construction from inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Diffusion Masked Pretraining for Dynamic Point Cloud
DiMP applies diffusion modeling to masked pretraining of dynamic point clouds to remove positional leakage and capture motion uncertainty, yielding 11.21% and 13.65% gains on offline and online action segmentation.
-
Diffusion Masked Pretraining for Dynamic Point Cloud
DiMP uses diffusion to infer clean masked positions from visible context and to model full distributions of point displacements rather than means, delivering 11.21% and 13.65% absolute gains on offline and online acti...
-
Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models
Mantis is the first Mamba-native PEFT framework for 3D point cloud models that injects task signals into state-space updates via State-Aware Adapters and regularizes serialization with Dual-Serialization Consistency D...
-
Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models
Mantis is the first Mamba-native PEFT framework for 3D point cloud models, using state-aware adapters and dual-serialization distillation to match performance with only 5% trainable parameters.
-
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Reference graph
Works this paper leans on
-
[1]
Hyper- point: Multimodal 3d foundation model in hyperbolic space,
Y . Sun, H. Cheng, C. Lu, Z. Li, M. Wu, H. Lu, and J. Zhu, “Hyper- point: Multimodal 3d foundation model in hyperbolic space,”Pattern Recognit., vol. 173, p. 112800, 2026
work page 2026
-
[2]
Leaf: Learning frames for 4d point cloud sequence understanding,
Y . Liu, J. Chen, Z. Zhang, J. Huang, and L. Yi, “Leaf: Learning frames for 4d point cloud sequence understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 604–613
work page 2023
-
[3]
J. Liu, J. Han, L. Liu, A. I. Aviles-Rivero, C. Jiang, Z. Liu, and H. Wang, “Mamba4d: Efficient 4d point cloud video understanding with disentangled spatial-temporal state space models,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 17 626–17 636
work page 2025
-
[4]
Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,
Z. Shen, X. Sheng, L. Wang, Y . Guo, Q. Liu, and X. Zhou, “Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1212–1222
work page 2023
-
[5]
Vg4d: Vision- language model goes 4d video recognition,
Z. Deng, X. Li, X. Li, Y . Tong, S. Zhao, and M. Liu, “Vg4d: Vision- language model goes 4d video recognition,” inIEEE Int. Conf. Robot. Autom.IEEE, 2024, pp. 5014–5020
work page 2024
-
[6]
Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,
Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., October 2023, pp. 16 580–16 589
work page 2023
-
[7]
ShapeNet: An Information-Rich 3D Model Repository
A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Suet al., “Shapenet: An information- rich 3d model repository,”arXiv preprint arXiv:1512.03012, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[8]
M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1588–1597
work page 2019
-
[9]
3d shapenets: A deep representation for volumetric shapes,
Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1912–1920
work page 2015
-
[10]
B. Lv, Y . Zha, T. Dai, X. Yuerong, K. Chen, and S.-T. Xia, “Adapting pre-trained 3d models for point cloud video understanding via cross- frame spatio-temporal perception,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 12 413–12 422
work page 2025
-
[11]
Learning modality knowledge alignment for cross-modality transfer,
W. Ma, S. Li, L. Cai, and J. Kang, “Learning modality knowledge alignment for cross-modality transfer,” inProc. Int. Conf. Mach. Learn. PMLR, 2024, pp. 33 777–33 793
work page 2024
-
[12]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2021
work page 2021
-
[13]
St-adapter: Parameter- efficient image-to-video transfer learning,
J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “St-adapter: Parameter- efficient image-to-video transfer learning,”Proc. Adv. Neural Inf. Pro- cess. Syst., vol. 35, pp. 26 462–26 477, 2022
work page 2022
-
[14]
Aim: Adapting image models for efficient video action recognition,
T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video action recognition,” inProc. Int. Conf. Learn. Representations, 2023
work page 2023
-
[15]
Geometric dataset distances via optimal transport,
D. Alvarez-Melis and N. Fusi, “Geometric dataset distances via optimal transport,”Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 21 428– 21 439, 2020
work page 2020
-
[16]
Pointnet: Deep learning on point sets for 3d classification and segmentation,
C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 652–660
work page 2017
-
[17]
Action recognition based on a bag of 3d points,
W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, 2010, pp. 9–14
work page 2010
-
[18]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction,
Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2022, pp. 21 013–21 022
work page 2022
-
[19]
4d spatio-temporal convnets: Minkowski convolutional neural networks,
C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” inProc. IEEE Conf. Com- put. Vis. Pattern Recognit., 2019, pp. 3075–3084
work page 2019
-
[20]
Context-aware 3d point cloud semantic segmentation with plane guidance,
T. Weng, J. Xiao, F. Yan, and H. Jiang, “Context-aware 3d point cloud semantic segmentation with plane guidance,”IEEE Trans. Multimedia, vol. 25, pp. 6653–6664, 2023
work page 2023
-
[21]
Self-supervised point cloud representation learning via separating mixed shapes,
C. Sun, Z. Zheng, X. Wang, M. Xu, and Y . Yang, “Self-supervised point cloud representation learning via separating mixed shapes,”IEEE Trans. Multimedia, vol. 25, pp. 6207–6218, 2023
work page 2023
-
[22]
Dual transformer for point cloud analysis,
X.-F. Han, Y .-F. Jin, H.-X. Cheng, and G.-Q. Xiao, “Dual transformer for point cloud analysis,”IEEE Trans. Multimedia, vol. 25, pp. 5638–5648, 2023
work page 2023
-
[23]
Mpct: Multiscale point cloud transformer with a residual network,
Y . Wu, J. Liu, M. Gong, Z. Liu, Q. Miao, and W. Ma, “Mpct: Multiscale point cloud transformer with a residual network,”IEEE Trans. Multimedia, vol. 26, pp. 3505–3516, 2024
work page 2024
-
[24]
Hyperbolic image-and- pointcloud contrastive learning for 3d classification,
N. Hu, H. Cheng, Y . Xie, P. Shi, and J. Zhu, “Hyperbolic image-and- pointcloud contrastive learning for 3d classification,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst.IEEE, 2024, pp. 4973–4979
work page 2024
-
[25]
H. Cheng, J. Zhu, J. Lu, and X. Han, “EDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation,”Expert Syst. Appl., vol. 237, p. 121551, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
work page 2024
-
[26]
Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,
S. Xie, J. Gu, D. Guo, C. R. Qi, and L. G. O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” inEur. Conf. Comput. Vis., 2020
work page 2020
-
[27]
Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,
M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thi- lakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 9902–9912
work page 2022
-
[28]
Masked autoencoders for point cloud self-supervised learning,
Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” inEur. Conf. Comput. Vis., 2022, pp. 604–621
work page 2022
-
[29]
Pointgpt: Auto-regressively generative pre-training from point clouds,
G. Chen, M. Wang, Y . Yang, K. Yu, L. Yuan, and Y . Yue, “Pointgpt: Auto-regressively generative pre-training from point clouds,”Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2024
work page 2024
-
[30]
Meteornet: Deep learning on dynamic 3d point cloud sequences,
X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 9245–9254
work page 2019
-
[31]
Point 4d transformer networks for spatio-temporal modeling in point cloud videos,
H. Fan, Y . Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021
work page 2021
-
[32]
Point primitive transformer for long-term 4d point cloud video understanding,
H. Wen, Y . Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4d point cloud video understanding,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 19–35
work page 2022
-
[33]
Z. Zhang, Y . Dong, Y . Liu, and L. Yi, “Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 661– 17 670
work page 2023
-
[34]
Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,
Z. Shen, X. Sheng, H. Fan, L. Wang, Y . Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self- supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 16 580–16 589
work page 2023
-
[35]
Cross-modal fine-tuning: Align then refine,
J. Shen, L. Li, L. M. Dery, C. Staten, M. Khodak, G. Neubig, and A. Talwalkar, “Cross-modal fine-tuning: Align then refine,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 31 030–31 056
work page 2023
-
[36]
Learning to prompt for vision- language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022
work page 2022
-
[37]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763
work page 2021
-
[38]
Instance- aware dynamic prompt tuning for pre-trained point cloud models,
Y . Zha, J. Wang, T. Dai, B. Chen, Z. Wang, and S.-T. Xia, “Instance- aware dynamic prompt tuning for pre-trained point cloud models,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 14 161–14 170
work page 2023
-
[39]
Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,
X. Zhou, D. Liang, W. Xu, X. Zhu, Y . Xu, Z. Zou, and X. Bai, “Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 707–14 717
work page 2024
-
[40]
Sinkhorn distances: lightspeed computation of optimal transport,
M. Cuturi, “Sinkhorn distances: lightspeed computation of optimal transport,” inProc. Adv. Neural Inf. Process. Syst., 2013
work page 2013
-
[41]
Parameter-efficient transfer learning for NLP,
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” inInt. Conf. Mach. Learn., vol. 97. PMLR, 09–15 Jun 2019, pp. 2790–2799
work page 2019
-
[42]
Point-bert: Pre- training 3d point cloud transformers with masked point modeling,
X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre- training 3d point cloud transformers with masked point modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022
work page 2022
-
[43]
Meteornet: Deep learning on dynamic 3d point cloud sequences,
X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019
work page 2019
-
[44]
Pstnet: point spatio-temporal convolution on point cloud sequences,
H. Fan, X. Yu, Y . Ding, Y . Yang, and M. Kankanhalli, “Pstnet: point spatio-temporal convolution on point cloud sequences,” inProc. Int. Conf. Learn. Representations, 2021
work page 2021
-
[45]
J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8510–8520
work page 2022
-
[46]
Point spatio-temporal transformer networks for point cloud video modeling,
H. Fan, Y . Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2023
work page 2023
-
[47]
L. Jing, Y . Xue, X. Yan, C. Zheng, D. Wang, R. Zhang, Z. Wang, H. Fang, B. Zhao, and Z. Li, “X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,” inProc. AAAI Conf. Artif. Intell., vol. 38, no. 3, 2024, pp. 2670–2678
work page 2024
-
[48]
Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,
X. Sheng, Z. Shen, and G. Xiao, “Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 8, 2023, pp. 9802–9810
work page 2023
-
[49]
X. Sheng, Z. Shen, G. Xiao, L. Wang, Y . Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learn- ing on point cloud videos,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 16 515–16 524
work page 2023
-
[50]
Ntu rgb+ d: A large scale dataset for 3d human activity analysis,
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1010–1019
work page 2016
-
[51]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space,
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” inProc. Adv. Neural Inf. Process. Syst., 2017
work page 2017
-
[52]
3dv: 3d dynamic voxel for action recognition in depth video,
Y . Wang, Y . Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 508–517
work page 2020
-
[53]
Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,
Y . Liu, C. Chen, Z. Wang, and L. Yi, “Crossvideo: Self-supervised cross- modal contrastive learning for point cloud video understanding,” inIEEE Int. Conf. Robot. Autom., 2024, pp. 12 436–12 442
work page 2024
-
[54]
An efficient pointlstm for point clouds based gesture recognition,
Y . Min, Y . Zhang, X. Chai, and X. Chen, “An efficient pointlstm for point clouds based gesture recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5760–5769
work page 2020
-
[55]
Flownet3d: Learning scene flow in 3d point clouds,
X. Liu, C. R. Qi, and L. J. Guibas, “Flownet3d: Learning scene flow in 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 529–537
work page 2019
-
[56]
Flot: Scene flow on point clouds guided by optimal transport,
G. Puy, A. Boulch, and R. Marlet, “Flot: Scene flow on point clouds guided by optimal transport,” inEur. Conf. Comput. Vis.Springer, 2020, pp. 527–544
work page 2020
-
[57]
Occlusion guided scene flow estimation on 3d point clouds,
B. Ouyang and D. Raviv, “Occlusion guided scene flow estimation on 3d point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2805–2814
work page 2021
-
[58]
Festa: Flow estimation via spatial-temporal attention for scene point clouds,
H. Wang, J. Pang, M. A. Lodhi, Y . Tian, and D. Tian, “Festa: Flow estimation via spatial-temporal attention for scene point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 173– 14 182
work page 2021
-
[59]
What matters for 3d scene flow network,
G. Wang, Y . Hu, Z. Liu, Y . Zhou, M. Tomizuka, W. Zhan, and H. Wang, “What matters for 3d scene flow network,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 38–55
work page 2022
-
[60]
Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,
W. Cheng and J. H. Ko, “Bi-pointflownet: Bidirectional learning for point cloud based scene flow estimation,” inEur. Conf. Comput. Vis. Springer, 2022, pp. 108–124
work page 2022
-
[61]
Scoop: Self- supervised correspondence and optimization-based scene flow,
I. Lang, D. Aiger, F. Cole, S. Avidan, and M. Rubinstein, “Scoop: Self- supervised correspondence and optimization-based scene flow,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 5281–5290
work page 2023
-
[62]
W. Cheng and J. H. Ko, “Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 10 007–10 016
work page 2023
-
[63]
Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,
J. Liu, G. Wang, W. Ye, C. Jiang, J. Han, Z. Liu, G. Zhang, D. Du, and H. Wang, “Difflow3d: Toward robust uncertainty-aware scene flow estimation with diffusion model,”arXiv preprint arXiv:2311.17456, 2023
-
[64]
SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,
Q. de Smedt, H. Wannous, J.-P. Vandeborre, J. Guerry, B. Le Saux, and D. Filliat, “SHREC’17 Track: 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,” inProc. 10th Eurographics Workshop on 3D Object Retrieval, Apr. 2017, pp. 1–6
work page 2017
-
[65]
Object scene flow for autonomous vehicles,
M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3061–3070
work page 2015
-
[66]
Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,
H. Cheng, J. Zhu, N. Hu, J. Chen, and W. Yan, “Ptm: Torus masking for 3d representation learning guided by robust and trusted teachers,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 12, pp. 12 158–12 170, 2024
work page 2024
-
[67]
Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,
R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y . Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,” inProc. Adv. Neural Inf. Process. Syst., 2022
work page 2022
-
[68]
Benchmarking optimization software with performance profiles,
E. D. Dolan and J. J. Mor ´e, “Benchmarking optimization software with performance profiles,”Math. Program., vol. 91, no. 2, pp. 201–213, 2002
work page 2002
-
[69]
A kernel method for the two-sample-problem,
A. Gretton, K. Borgwardt, M. Rasch, B. Sch ¨olkopf, and A. Smola, “A kernel method for the two-sample-problem,”Adv. Neural Inf. Process. Syst., vol. 19, 2006
work page 2006
-
[70]
Similarity of neural network representations revisited,
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inProc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 3519–3529
work page 2019
-
[71]
M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEur. Conf. Comput. Vis.Springer, 2022, pp. 709–727
work page 2022
-
[72]
Parameter- efficient fine-tuning in spectral domain for point cloud learning,
D. Liang, T. Feng, X. Zhou, Y . Zhang, Z. Zou, and X. Bai, “Parameter- efficient fine-tuning in spectral domain for point cloud learning,”IEEE Trans. Pattern Anal. Mach. Intell., 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.