arxiv: 2605.03639 · v2 · submitted 2026-05-05 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Diffusion Masked Pretraining for Dynamic Point Cloud

Zhuoyue Zhang , Jihua Zhu , Chaowei Fang , Jian Liu , Ajmal Saeed Mian

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords dynamic point cloudmasked pretrainingdiffusion modelself-supervised learningaction segmentationpositional leakagemotion distributionpoint cloud video

0 comments

The pith

DiMP applies diffusion to masked tube centers and inter-frame displacements to remove positional leakage and model full motion distributions in dynamic point cloud pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked reconstruction pretraining for dynamic point clouds currently injects ground-truth tube centers into the decoder, creating spatio-temporal leakage, and supervises motions with single deterministic averages that discard trajectory uncertainty. DiMP counters this by diffusing noise only onto masked centers and training the model to recover clean centers from visible context, while reformulating displacement learning as a noise-prediction task that targets the entire conditional distribution of plausible motions. These changes produce encoder representations that transfer more effectively to downstream tasks. The paper reports absolute accuracy gains of 11.21 percent on offline action segmentation and 13.65 percent under causal online inference relative to the same backbone trained without DiMP.

Core claim

We propose Diffusion Masked Pretraining (DiMP), a self-supervised framework that introduces diffusion modeling into both positional inference and motion learning for dynamic point clouds. It applies forward diffusion noise exclusively to masked tube centers and predicts the clean centers from visible spatio-temporal context, thereby removing positional leakage while retaining clean temporal anchors. It further recasts point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations, driving the encoder to capture the full conditional distribution of motions rather than collapsing to deterministic means.

What carries the argument

Diffusion applied selectively to masked tube centers for positional inference combined with DDPM noise-prediction for conditioned inter-frame displacements.

If this is right

Encoder representations transfer to offline action segmentation with an absolute accuracy gain of 11.21 percent over the backbone alone.
Performance improves by 13.65 percent under causally constrained online inference settings.
Visible coordinates remain clean temporal anchors because diffusion noise is applied only to masked centers.
The encoder learns the full conditional distribution of plausible inter-frame motions instead of single deterministic estimates.
The same unified framework supports both positional and motion objectives without separate proxy targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-diffusion pattern could be tested on masked pretraining for video or skeleton sequences where positional leakage is also common.
Tasks that explicitly require uncertainty estimates, such as future trajectory prediction, might show larger benefits from the distributional motion objective than classification tasks.
Running DiMP on point-cloud datasets with varying density or longer temporal spans would test whether the leakage and distribution fixes remain effective at scale.
If the gains prove robust, DiMP-style pretraining could reduce the amount of labeled data needed for fine-tuning dynamic point cloud models.

Load-bearing premise

The downstream gains are caused specifically by the diffusion-based removal of leakage and the modeling of motion distributions rather than by other differences in training setup or dataset effects.

What would settle it

An ablation experiment that keeps all other training details fixed but reintroduces direct ground-truth positional embeddings or switches back to deterministic mean targets for displacements, showing that the reported accuracy gains disappear.

Figures

Figures reproduced from arXiv: 2605.03639 by Ajmal Saeed Mian, Chaowei Fang, Jian Liu, Jihua Zhu, Zhuoyue Zhang.

**Figure 1.** Figure 1: Motivation of DiMP. (a) Deterministic models collapse multimodal motion distributions to view at source ↗

**Figure 2.** Figure 2: The pipeline of DiMP. Given a dynamic point cloud, spatiotemporal tubes are extracted and view at source ↗

**Figure 3.** Figure 3: Structure of a VisMaskBlock. Visible tokens are updated by self-attention, while masked tokens query the visible stream through cross-attention. This asymmetric design preserves clean visible context and enables leakage-free inference for masked tubes. We present Diffusion Masked Pretraining (DiMP), a unified self-supervised pretraining framework for dynamic point clouds that incorporates two complement… view at source ↗

**Figure 4.** Figure 4: Five samples from pθ(M | Zdec) with ground truth (red) for the same input. Sample diversity confirms multimodal trajectory modeling rather than mean collapse. Theoretical motivation. The following result (Appendix F) motivates DiMP’s motion branch. Proposition 1 (Bayes-error obstruction of mean regression, informal). If two action classes a1, a2 share the same class-conditional mean trajectory E[M | A=a1]… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison under different pretraining strategies. Full DiMP yields results visibly closer to ground truth, while removing motion diffusion causes clear degradation. Online Inference on HOI4D. We evaluate DiMP under the causally constrained online inference protocol of PointNet4D [39]. As shown in view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of point cloud reconstruction on HOI4D. Each row shows a view at source ↗

read the original abstract

Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiMP uses diffusion on masked centers and DDPM noise prediction for motion to avoid leakage and mean collapse in dynamic point cloud pretraining, with reported segmentation gains that look worth checking but need tighter controls.

read the letter

The core idea here is straightforward: take masked reconstruction for dynamic point clouds, add diffusion noise only to the masked tube centers so the decoder cannot cheat with ground-truth positions, and replace deterministic displacement targets with a noise-prediction objective that keeps the full motion distribution. That pairing is the actual novelty. Prior masked methods for point clouds already used tubes and reconstruction, but they leaked positions and collapsed trajectories to averages. DiMP tries to fix both without changing the overall encoder-decoder shape much. The abstract reports clear downstream lifts—11%+ on offline action segmentation and 13%+ on causal online inference—which is the kind of number that gets attention in this subfield. If the full experiments hold up under matched training budgets, that is useful incremental progress for anyone pretraining on video point clouds or 3D action data. The paper does a decent job stating the two concrete problems it targets and showing that the diffusion versions are at least compatible with existing backbones. What is less clear is how much of the gain comes from the diffusion mechanics versus incidental differences in optimization, augmentation, or loss weighting. The stress-test note is right on this: without ablations that hold optimizer schedule, epoch count, decoder architecture, and data pipeline fixed while toggling only the positional diffusion and noise-prediction terms, attribution stays soft. The abstract claims “extensive experiments,” but if those controls are missing or incomplete, the headline numbers could be overstated. The method itself is not circular; it is a new training objective rather than a re-derivation of an old quantity. For readers already working on masked autoencoders or diffusion for 3D sequences, this is worth a look because the adaptations are specific and the tasks are standard. It is not foundational, but it is concrete enough that a serious editor should send it to referees rather than desk-reject. Ask the authors for the matched ablations and any failure cases on the motion distribution modeling. That would make the contribution easier to judge.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Diffusion Masked Pretraining (DiMP), a self-supervised framework for dynamic point clouds that integrates diffusion modeling into masked pretraining. It applies forward diffusion only to masked tube centers and predicts clean centers from visible spatio-temporal context to eliminate positional leakage while retaining clean anchors. It further reformulates point-wise inter-frame displacement learning as a conditioned DDPM noise-prediction objective to capture the full conditional motion distribution rather than collapsing to deterministic means. Experiments report absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference relative to the backbone alone, with code released at https://github.com/InitalZ/DiMP.git.

Significance. If the reported gains can be isolated to the diffusion-based positional inference and noise-prediction components, the work offers a principled way to address leakage and distributional collapse in dynamic point cloud pretraining, which could improve representations for downstream video and 3D tasks. The release of reproducible code is a clear strength that supports verification and extension.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: The headline claim attributes the 11.21% offline and 13.65% online gains specifically to the diffusion positional inference (removing leakage while preserving visible anchors) and the DDPM noise-prediction objective (modeling full conditional motion distributions). No description is given of matched ablations that hold optimizer schedule, total epochs, data augmentations, decoder architecture, and loss weighting fixed while toggling only these two diffusion elements. Without such controls, the improvements cannot be confidently attributed to the proposed components rather than incidental training differences.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific backbone architecture and the datasets used for the reported action segmentation results to allow immediate assessment of the scope of the gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to provide tighter experimental controls.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The headline claim attributes the 11.21% offline and 13.65% online gains specifically to the diffusion positional inference (removing leakage while preserving visible anchors) and the DDPM noise-prediction objective (modeling full conditional motion distributions). No description is given of matched ablations that hold optimizer schedule, total epochs, data augmentations, decoder architecture, and loss weighting fixed while toggling only these two diffusion elements. Without such controls, the improvements cannot be confidently attributed to the proposed components rather than incidental training differences.

Authors: We thank the referee for this important observation on experimental rigor. Our current backbone baseline employs the identical architecture, optimizer schedule, epoch count, data augmentations, decoder design, and loss weighting as DiMP; the sole difference is the pretraining objective (standard masked reconstruction versus our diffusion positional inference plus DDPM noise prediction). This already isolates the effect of the two diffusion components. To provide the requested matched ablations that toggle only these elements, we will add new tables in the revised Experiments section that independently enable/disable the diffusion positional inference and the conditioned DDPM motion objective while freezing all other training details. These results will confirm that the reported gains arise from the proposed diffusion modeling. revision: yes

Circularity Check

0 steps flagged

DiMP proposes independent diffusion objectives with no self-referential derivation chain

full rationale

The paper introduces a new self-supervised pretraining framework that augments masked reconstruction with forward diffusion on masked tube centers and DDPM noise prediction on displacements. No equations, parameters, or claims reduce the reported downstream gains (11.21% offline, 13.65% online) to quantities defined by construction from the same paper's fitted inputs or prior self-citations. The method is presented as an empirical training paradigm whose benefits are measured on downstream tasks rather than derived tautologically from its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Standard diffusion assumptions (reversibility of the forward process) are implicit but not enumerated.

pith-pipeline@v0.9.0 · 5529 in / 1230 out tokens · 56286 ms · 2026-05-12T03:44:00.641819+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DiMP introduces diffusion modeling into both positional inference and motion learning... reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

applies forward diffusion noise only to masked tube centers... predicts clean centers from visible spatio-temporal context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[2]

Masked autoencoders for point cloud self-supervised learning

Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[3]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[4]

Hyperpoint: Multimodal 3d foundation model in hyperbolic space.Pattern Recognition (PR), 2025

Yiding Sun, Haozhe Cheng, Chaoyi Lu, Zhengqiao Li, Minghong Wu, Huimin Lu, and Jihua Zhu. Hyperpoint: Multimodal 3d foundation model in hyperbolic space.Pattern Recognition (PR), 2025

work page 2025
[5]

Pointdico: Contrastive 3d representation learning guided by diffusion models

Pengbo Li, Yiding Sun, and Haozhe Cheng. Pointdico: Contrastive 3d representation learning guided by diffusion models. In2025 International Joint Conference on Neural Networks (IJCNN). IEEE, 2025

work page 2025
[6]

Point cloud pre-training with diffusion models

Xiao Zheng, Xiaoshui Huang, Guofeng Mei, Yuenan Hou, Zhaoyang Lyu, Bo Dai, Wanli Ouyang, and Yongshun Gong. Point cloud pre-training with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[7]

Diffpmae: diffusion masked autoen- coders for point cloud reconstruction

Yanlong Li, Chamara Madarasingha, and Kanchana Thilakarathna. Diffpmae: diffusion masked autoen- coders for point cloud reconstruction. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[8]

Point-madi: Masked autoencoding with diffusion for point cloud pre-training

Xiaoyang Xiao, Runzhao Yao, Zhiqiang Tian, and Shaoyi Du. Point-madi: Masked autoencoding with diffusion for point cloud pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[9]

Denoising diffusion probabilistic models.Advances in neural information processing systems (NeurIPS), 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems (NeurIPS), 2020

work page 2020
[10]

Representation learning by detecting incorrect location embeddings

Sepehr Sameni, Simon Jenni, and Paolo Favaro. Representation learning by detecting incorrect location embeddings. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023

work page 2023
[11]

Droppos: Pre-training vision transformers by reconstructing dropped positions.Advances in Neural Information Processing Systems (NeurIPS), 2023

Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, and ZHAO-XIANG ZHANG. Droppos: Pre-training vision transformers by reconstructing dropped positions.Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[12]

Location-aware self-supervised transformers for semantic segmentation

Mathilde Caron, Neil Houlsby, and Cordelia Schmid. Location-aware self-supervised transformers for semantic segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

work page 2024
[13]

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

Yiding Sun, Jihua Zhu, Haozhe Cheng, Chaoyi Lu, Zhichuan Yang, Lin Chen, and Yaonan Wang. Align then adapt: Rethinking parameter-efficient transfer learning in 4d perception.arXiv preprint arXiv:2602.23069, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos

Zhiqiang Shen, Xiaoxiao Sheng, Hehe Fan, Longguang Wang, Yulan Guo, Qiong Liu, Hao Wen, and Xi Zhou. Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[15]

Masked motion prediction with semantic contrast for point cloud sequence learning

Yuehui Han, Can Xu, Rui Xu, Jianjun Qian, and Jin Xie. Masked motion prediction with semantic contrast for point cloud sequence learning. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 2024

work page 2024
[16]

Social GAN: Socially acceptable trajectories with generative adversarial networks

Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social GAN: Socially acceptable trajectories with generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[17]

Trajectron++: Dynamically- feasible trajectory forecasting with heterogeneous data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically- feasible trajectory forecasting with heterogeneous data. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020

work page 2020
[18]

Qi, Leonidas Guibas, and Or Litany

Saining Xie, Jiatao Gu, Demi Guo, Charles R. Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre-training for 3D point cloud understanding. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020. 10

work page 2020
[19]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[20]

Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training

Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[21]

Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders

Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, and Hongsheng Li. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[22]

Pointcot: A multi-modal benchmark for explicit 3d geometric reasoning.arXiv preprint arXiv:2602.23945, 2026

Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wenbiao Yan, Ning Yang, et al. Pointcot: A multi-modal benchmark for explicit 3d geometric reasoning.arXiv preprint arXiv:2602.23945, 2026

work page arXiv 2026
[23]

Pointrft: Explicit reinforcement fine-tuning for point cloud few-shot learning.arXiv preprint arXiv:2603.23957, 2026

Yankai Wang, Yiding Sun, Qirui Wang, Pengbo Li, Chaoyi Lu, and Dongxu Zhang. Pointrft: Explicit reinforcement fine-tuning for point cloud few-shot learning.arXiv preprint arXiv:2603.23957, 2026

work page arXiv 2026
[24]

Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining

Zekun Qi, Runpei Dong, Guofan Fan, Zheng Ge, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. InProceedings of the International Conference on Machine Learning (ICML). PMLR, 2023

work page 2023
[25]

Self-supervised 4d spatio- temporal feature learning via order prediction of sequential point cloud clips

Haiyan Wang, Liang Yang, Xuejian Rong, Jinglun Feng, and Yingli Tian. Self-supervised 4d spatio- temporal feature learning via order prediction of sequential point cloud clips. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

work page 2021
[26]

Contrastive predictive autoencoders for dynamic point cloud self-supervised learning

Xiaoxiao Sheng, Zhiqiang Shen, and Gang Xiao. Contrastive predictive autoencoders for dynamic point cloud self-supervised learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023

work page 2023
[27]

Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos

Zhiqiang Shen, Xiaoxiao Sheng, Longguang Wang, Yulan Guo, Qiong Liu, and Xi Zhou. Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[28]

Point contrastive prediction with semantic clustering for self-supervised learning on point cloud videos

Xiaoxiao Sheng, Zhiqiang Shen, Gang Xiao, Longguang Wang, Yulan Guo, and Hehe Fan. Point contrastive prediction with semantic clustering for self-supervised learning on point cloud videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[29]

Uni4d: A unified self-supervised learning framework for point cloud videos.arXiv preprint arXiv:2504.04837, 2025

Zhi Zuo, Chenyi Zhuang, Pan Gao, Jie Qin, Hao Feng, and Nicu Sebe. Uni4d: A unified self-supervised learning framework for point cloud videos.arXiv preprint arXiv:2504.04837, 2025

work page arXiv 2025
[30]

Point 4d transformer networks for spatio-temporal modeling in point cloud videos

Hehe Fan, Yi Yang, and Mohan Kankanhalli. Point 4d transformer networks for spatio-temporal modeling in point cloud videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[31]

Corrnet3d: Unsupervised end-to-end learning of dense correspondence for 3d point clouds

Yiming Zeng, Yue Qian, Zhiyu Zhu, Junhui Hou, Hui Yuan, and Ying He. Corrnet3d: Unsupervised end-to-end learning of dense correspondence for 3d point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[32]

Point primitive transformer for long-term 4d point cloud video understanding

Hao Wen, Yunze Liu, Jingwei Huang, Bo Duan, and Li Yi. Point primitive transformer for long-term 4d point cloud video understanding. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 2022

work page 2022
[33]

Spatio-temporal self-supervised representation learning for 3d point clouds

Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu. Spatio-temporal self-supervised representation learning for 3d point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021

work page 2021
[34]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[35]

Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning

Zhuoyang Zhang, Yuhao Dong, Yunze Liu, and Li Yi. Complete-to-partial 4d distillation for self-supervised point cloud sequence representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 11

work page 2023
[36]

Pstnet: Point spatio-temporal convolution on point cloud sequences

Hehe Fan, Xin Yu, Yuhang Ding, Yi Yang, and Mohan Kankanhalli. Pstnet: Point spatio-temporal convolution on point cloud sequences. InProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[37]

Point spatio-temporal transformer networks for point cloud video modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Hehe Fan, Yezhou Yang, and Mohan Kankanhalli. Point spatio-temporal transformer networks for point cloud video modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

work page 2022
[38]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[39]

Pointnet4d: A lightweight 4d point cloud video backbone for online and offline perception in robotic applications

Yunze Liu, Zifan Wang, Peiran Wu, and Jiayang Ao. Pointnet4d: A lightweight 4d point cloud video backbone for online and offline perception in robotic applications. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

work page 2026
[40]

Nsm4d: Neural scene model based online 4d point cloud sequence understanding.arXiv preprint arXiv:2310.08326, 2023

Yuhao Dong et al. Nsm4d: Neural scene model based online 4d point cloud sequence understanding.arXiv preprint arXiv:2310.08326, 2023

work page arXiv 2023
[41]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[42]

Action recognition based on a bag of 3d points

Wanqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops). IEEE, 2010

work page 2010
[43]

Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset

Quentin De Smedt, Hazem Wannous, Jean-Philippe Vandeborre, Joris Guerry, Bertrand Le Saux, and David Filliat. Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset. InProceedings of the Eurographics Workshop on 3D Object Retrieval (3DOR), 2017

work page 2017
[44]

Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and Jan Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[45]

sg(Lgeo)

Boris Ivanovic and Marco Pavone. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 12 Appendix table of contents A Preliminaries 14 B Experimental details 14 C Additional ablation studies 15 D Additional experimental resu...

work page 2019
[46]

All data used are from publicly released benchmarks that were collected and approved by their original authors

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page