pith. machine review for the scientific record. sign in

arxiv: 2604.15281 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.RO

Recognition: unknown

R3D: Revisiting 3D Policy Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:54 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D policy learningimitation learningtransformer encoderdiffusion decoderrobotic manipulationtraining stability3D perceptiongeneralization
0
0 comments X

The pith

Coupling a transformer 3D encoder with a diffusion decoder stabilizes large-scale 3D policy learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that 3D policy learning has been blocked by training instabilities and severe overfitting that prevent using powerful 3D perception models. The authors trace the problems to the omission of 3D data augmentation and the adverse effects of batch normalization. They introduce an architecture that pairs a scalable transformer-based 3D encoder with a diffusion decoder built for stability at large scale and to benefit from pre-training. If the diagnosis and fix hold, this creates a reliable path to better generalization and cross-embodiment transfer in 3D imitation learning for manipulation tasks.

Core claim

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchm

What carries the argument

Scalable transformer-based 3D encoder coupled with a diffusion decoder engineered for stability at scale and to leverage large-scale pre-training

If this is right

  • significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks
  • establishes a new and robust foundation for scalable 3D imitation learning
  • promises superior generalization and cross-embodiment transfer
  • allows adoption of powerful 3D perception models

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The specific diagnosis of augmentation and normalization issues could guide fixes in other 3D vision-based learning systems.
  • Emphasis on pre-training suggests value in building larger 3D datasets tailored to policy tasks.
  • If the causes are correctly identified, incremental changes to prior methods might narrow much of the performance gap without a full redesign.

Load-bearing premise

The omission of 3D data augmentation and the adverse effects of batch normalization are the primary causes of instability and overfitting, and the proposed transformer-diffusion architecture resolves these issues at scale without new failure modes.

What would settle it

Training existing 3D policy baselines with added 3D data augmentation and without batch normalization, then comparing their performance to the new architecture on the same manipulation benchmarks.

Figures

Figures reproduced from arXiv: 2604.15281 by Boyi Zhao, Guofeng Zhang, Hangxing Zhang, Haozhe Cui, Jiayuan Gu, Jun Wang, Ran Ji, Shenrui Wu, Yiyang He, Zhengdong Hong, Zundong Ke.

Figure 2
Figure 2. Figure 2: Success rates with varied decoder depth (each point is averaged over 5 ckpts). curves and a marked decline in success rates as training progresses. The introduction of data aug￾mentation not only stabilized the training process but also significantly improved success rates. This improvement arises because the policy leverages a broader range of out-of-distribution data, thereby enhancing its generalization… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline architecture: our 3D policy architecture consists of a point cloud encoder encoding input point clouds and a diffusion transformer to decode actions. The encoder can be pre-trained on 3D segmentation tasks. Diffusion Formulation. We model the policy using a conditional diffusion process. Follow￾ing [7], the model learns to refine a noise sequence a (K) ∼ N (0, I) into a clean action sequence a (0)… view at source ↗
Figure 4
Figure 4. Figure 4: Difference between decoders used in DP3 [49] and ours. DP3 uses a global feature extracted from the input point cloud, and conditions the denoising process via FiLM. Ours uses dense features, and conditions the denoising process through cross-attention. Figure b) details the structure of a single attention block. intentionally omit this global pooling, instead preserving the full spatial resolution of the … view at source ↗
Figure 5
Figure 5. Figure 5: Simulation Settings: The point clouds shown below are generated by cropping the table and the ground, followed by downsampling to 1024 points each. Simulation Environment. We evaluate our method on two widely-used simulation bench￾marks, RoboTwin 2.0 [6] and ManiSkill2 [15]. RoboTwin 2.0 [6] is a scalable benchmark for evaluating robust bimanual manipulation across 50 tasks. It features an “Easy” setting w… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on different encoder design choices. Here are some conclusions drawn from ablation studies (Tab. 7 and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper diagnoses instabilities and severe overfitting in 3D policy learning as stemming primarily from the omission of 3D data augmentation and the use of Batch Normalization. It introduces the R3D architecture, which pairs a scalable transformer-based 3D encoder with a diffusion-based decoder, designed for training stability at scale and compatibility with large-scale pre-training. The central empirical claim is that this approach significantly outperforms prior 3D baselines on challenging manipulation benchmarks, providing a robust foundation for scalable 3D imitation learning.

Significance. If the reported performance gains and ablation evidence hold under scrutiny, the work would address a key barrier to adopting expressive 3D perception models in imitation learning, potentially enabling better generalization and cross-embodiment transfer. The explicit focus on stability fixes (augmentation and normalization) plus an architecture suited to pre-training represents a constructive engineering contribution in a domain where empirical progress has been limited by training difficulties.

minor comments (3)
  1. The abstract and high-level description would be strengthened by including at least one or two key quantitative results (e.g., success rates or relative improvement percentages) alongside the qualitative claim of outperformance.
  2. Clarify the precise experimental protocol for the 3D data augmentation (types, magnitudes, and application schedule) and the exact modifications to Batch Normalization, as these are presented as the primary fixes.
  3. Ensure that all baseline comparisons control for training compute, data scale, and hyperparameter tuning effort to support the attribution of gains to the proposed architecture and fixes.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. The referee accurately captures the core diagnosis of instabilities in 3D policy learning and the proposed R3D architecture. As the report contains no specific major comments to address, we have no point-by-point rebuttals at this stage.

Circularity Check

0 steps flagged

No significant circularity; empirical architecture proposal

full rationale

The paper is an empirical contribution that diagnoses training instabilities in 3D policy learning via omission of data augmentation and batch-norm effects, then proposes a transformer-based 3D encoder plus diffusion decoder for stability at scale. All central claims reduce to benchmark outperformance on manipulation tasks rather than any mathematical derivation, prediction, or first-principles result. No equations appear that could reduce a claimed output to a fitted input or self-referential definition by construction. Self-citations, if present, are not load-bearing for the architecture choice or performance claims. The work is therefore self-contained as an engineering and experimental advance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about transformer and diffusion model trainability on 3D data; no explicit free parameters, new entities, or ad-hoc axioms are stated in the abstract.

axioms (1)
  • domain assumption Transformer and diffusion models can be stably trained on 3D point-cloud inputs when batch normalization is removed and 3D-specific augmentations are added.
    Implicit premise required for the proposed architecture to succeed at scale.

pith-pipeline@v0.9.0 · 5453 in / 1199 out tokens · 34397 ms · 2026-05-10T10:54:37.804003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 39 canonical work pages · 13 internal anchors

  1. [1]

    arXiv preprint arXiv:2111.08897 (2021)

    Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021) 7, 19

  2. [2]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

    Bhat, V., Lan, Y.H., Krishnamurthy, P., Karri, R., Khorrami, F.: 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks. arXiv preprint arXiv:2505.05800 (2025) 3

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024) 1, 2, 12, 13, 21

  4. [4]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111 (2025) 1, 2

  5. [5]

    Ac-dit: Adaptive coordination diffusion transformer for mobile manipulation.arXiv preprint arXiv:2507.01961, 2025

    Chen, S., Liu, J., Qian, S., Jiang, H., Li, L., Zhang, R., Liu, Z., Gu, C., Hou, C., Wang, P., et al.: Ac-dit: Adaptive coordination diffusion transformer for mobile manipulation. arXiv preprint arXiv:2507.01961 (2025) 3

  6. [6]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Li, Z., Liang, Q., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025) 3, 4, 8, 9, 10, 21

  7. [7]

    The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 1, 2, 5, 6, 10, 12, 13, 17, 21

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025) 1, 2, 5, 6, 10, 12, 13, 17, 21

  8. [8]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly- annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017) 7, 19

  9. [9]

    scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry, K., Alex, I., Peter, P., Julian, I., Alexander, H., Eric, J., Deirdre, Q., Ethan, H., Mri- nal, K., Vincent, V., et al.: Qt-opt. scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint (2018) 4

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 4, 6

  11. [11]

    arXiv preprint arXiv:2503.03081 (2025) 20

    Fang, H., Wang, C., Wang, Y., Chen, J., Xia, S., Lv, J., He, Z., Yi, X., Guo, Y., Zhan, X., et al.: Airexo-2: Scaling up generalizable robotic imitation learning with low-cost exoskeletons. arXiv preprint arXiv:2503.03081 (2025) 20

  12. [12]

    Act3d: Infinite resolution action detection transformer for robotic manipulation

    Gervet, T., Xian, Z., Gkanatsios, N., Fragkiadaki, K.: Act3d: 3d feature field transformers for multi-task robotic manipulation. arXiv preprint arXiv:2306.17817 (2023) 3

  13. [13]

    arXiv preprint arXiv:2508.11002 (2025) 3

    Gkanatsios, N., Xu, J., Bronars, M., Mousavian, A., Ke, T.W., Fragkiadaki, K.: 3d flowmatch actor: Unified 3d policy for single-and dual-arm manipulation. arXiv preprint arXiv:2508.11002 (2025) 3

  14. [14]

    In: Conference on Robot Learning

    Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: Rvt: Robotic view transformer for 3d object manipulation. In: Conference on Robot Learning. pp. 694–710. PMLR (2023) 3

  15. [15]

    Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

    Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., Tang, Y., Tao, S., Wei, X., Yao, Y., et al.: Man- iskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659 (2023) 9, 10, 22

  16. [16]

    International Conference on Intelligent Robots and Systems (IROS) (2024) 18

    Hong, Z., Zheng, K., Chen, L.: Fully automatic hand-eye calibration with pretrained image models. International Conference on Intelligent Robots and Systems (IROS) (2024) 18

  17. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Hou, C., Ze, Y., Fu, Y., Gao, Z., Hu, S., Yu, Y., Zhang, S., Xu, H.: 4d visual pre-training for robot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8451–8461 (October 2025) 3

  18. [18]

    Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic ma- nipulation.arXiv preprint arXiv:2411.18623, 2024

    Jia, Y., Liu, J., Chen, S., Gu, C., Wang, Z., Luo, L., Lee, L., Wang, P., Wang, Z., Zhang, R., et al.: Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation. arXiv preprint arXiv:2411.18623 (2024) 3 14

  19. [19]

    3d diffuser actor: Policy diffusion with 3d scene representations, 2024

    Ke, T.W., Gkanatsios, N., Fragkiadaki, K.: 3d diffuser actor: Policy diffusion with 3d scene repre- sentations. arXiv preprint arXiv:2402.10885 (2024) 1, 3

  20. [20]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645 (2025) 2

  21. [21]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 1, 2

  22. [22]

    arXiv preprint arXiv:2503.07511 (2025)

    Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y.: Pointvla: Injecting the 3d world into vision- language-action models. arXiv preprint arXiv:2503.07511 (2025) 3

  23. [23]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

    Li, F., Song, W., Zhao, H., Wang, J., Ding, P., Wang, D., Zeng, L., Li, H.: Spatial forcing: Implicit spatial representation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276 (2025) 3, 10, 21

  24. [24]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

    Li, P., Chen, Y., Wu, H., Ma, X., Wu, X., Huang, Y., Wang, L., Kong, T., Tan, T.: Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models. arXiv preprint arXiv:2506.07961 (2025) 3

  25. [25]

    Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

    Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang, C., Jing, Y., Zhang, W., Liu, H., et al.: Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378 (2023) 2

  26. [26]

    arXiv preprint arXiv:2509.02530 (2025) 12, 18

    Liu, M., Zhu, Z., Han, X., Hu, P., Lin, H., Li, X., Chen, J., Xu, J., Yang, Y., Lin, Y., et al.: Manipulation as in simulation: Enabling accurate geometry perception in robots. arXiv preprint arXiv:2509.02530 (2025) 12, 18

  27. [27]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024) 2, 10, 21

  28. [28]

    V o-dp: Semantic-geometric adaptive diffu- sion policy for vision-only robotic manipulation.arXiv preprint arXiv:2510.15530, 2025

    Ni, Z., He, Y., Qian, L., Mao, J., Fu, F., Sui, W., Su, H., Peng, J., Wang, Z., He, B.: Vo-dp: Semantic-geometric adaptive diffusion policy for vision-only robotic manipulation. arXiv preprint arXiv:2510.15530 (2025) 3

  29. [29]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al.: Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 6892–6903. IEEE (2024) 1, 2

  30. [30]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 2, 7, 9

  31. [31]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017) 4, 6, 9

  32. [32]

    Advances in neural information processing systems30(2017) 4

    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems30(2017) 4

  33. [33]

    arXiv preprint arXiv:2509.15733 (2025) 3

    Qian, Q., Zhao, G., Zhang, G., Wang, J., Xu, R., Gao, J., Zhao, D.: Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation. arXiv preprint arXiv:2509.15733 (2025) 3

  34. [34]

    3d-mvp: 3d multi- view pretraining for robotic manipulation.arXiv preprint arXiv:2406.18158, 2024

    Qian, S., Mo, K., Blukis, V., Fouhey, D.F., Fox, D., Goyal, A.: 3d-mvp: 3d multiview pretraining for robotic manipulation. arXiv preprint arXiv:2406.18158 (2024) 3

  35. [35]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu, D., Song, H., Chen, Q., Yao, Y., Ye, X., Ding, Y., Wang, Z., Gu, J., Zhao, B., Wang, D., et al.: Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830 (2025) 3

  36. [36]

    In: Conference on Robot Learning

    Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: A multi-task transformer for robotic manipu- lation. In: Conference on Robot Learning. pp. 785–799. PMLR (2023) 2

  37. [37]

    Geovla: Em- powering 3d representations in vision-language-action models,

    Sun, L., Xie, B., Liu, Y., Shi, H., Wang, T., Cao, J.: Geovla: Empowering 3d representations in vision-language-action models. arXiv preprint arXiv:2508.09071 (2025) 3

  38. [38]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024) 2 15

  39. [39]

    Improving robotic manipulation with efficient geometry-aware vision encoder.arXiv preprint arXiv:2509.15880, 2025

    Vuong, A.D., Vu, M.N., Reid, I.: Improving robotic manipulation with efficient geometry-aware vision encoder. arXiv preprint arXiv:2509.15880 (2025) 10, 21

  40. [40]

    Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding

    Wang, P., He, Y., Lv, X., Zhou, Y., Xu, L., Yu, J., Gu, J.: Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding. arXiv preprint arXiv:2510.20155 (2025) 7, 19

  41. [41]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large- scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139 (2023) 2

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H.: Point transformer v3: Simpler faster stronger. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4840–4851 (2024) 1

  43. [43]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3153–3160. IEEE (2024) 1

  44. [44]

    arXiv preprint arXiv:2509.01819 (2025) 3, 9, 10, 12, 13, 21

    Yan, G., Zhu, J., Deng, Y., Yang, S., Qiu, R.Z., Cheng, X., Memmel, M., Krishna, R., Goyal, A., Wang, X., et al.: Maniflow: A general robot manipulation policy via consistency flow training. arXiv preprint arXiv:2509.01819 (2025) 3, 9, 10, 12, 13, 21

  45. [45]

    Fp3: A 3d foundation policy for robotic manipulation.arXiv preprint arXiv:2503.08950, 2025

    Yang, R., Chen, G., Wen, C., Gao, Y.: Fp3: A 3d foundation policy for robotic manipulation. arXiv preprint arXiv:2503.08950 (2025) 3

  46. [46]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu, H., Jin, Y., He, Y., Sui, W.: Efficient task-specific conditional diffusion policies: Shortcut model acceleration and so (3) optimization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4174–4183 (2025) 2

  47. [47]

    DepthVLA: Enhancing vision-language-action models with depth-aware spatial rea- soning.arXiv preprint arXiv:2510.13375, 2025

    Yuan, T., Liu, Y., Lu, C., Chen, Z., Jiang, T., Zhao, H.: Depthvla: Enhancing vision-language- action models with depth-aware spatial reasoning. arXiv preprint arXiv:2510.13375 (2025) 3

  48. [48]

    arXiv e-prints pp

    Ze,Y.,Chen,Z.,Wang,W.,Chen,T.,He,X.,Yuan,Y.,Peng,X.B.,Wu,J.:Generalizablehumanoid manipulation with improved 3d diffusion policies. arXiv e-prints pp. arXiv–2410 (2024) 1

  49. [49]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Generalizable visuo- motor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024) 1, 3, 4, 6, 7, 10, 12, 13, 17, 21, 22

  50. [50]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023) 1, 2

  51. [51]

    arXiv preprint arXiv:2506.19269 (2025) 3

    Zhao,Z.,Fan,K.,Xu,H.Y.,Qiao,N.,Peng,B.,Gao,W.,Li,D.,Shen,H.:Anchordp3:3daffordance guided sparse diffusion policy for robotic manipulation. arXiv preprint arXiv:2506.19269 (2025) 3

  52. [52]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision- language-action generative world model. arXiv preprint arXiv:2403.09631 (2024) 3

  53. [53]

    Uni3d: Ex- ploring unified 3d representation at scale,

    Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d represen- tation at scale. arXiv preprint arXiv:2310.06773 (2023) 1, 4, 6, 9, 20

  54. [54]

    arXiv preprint arXiv:2406.17741 (2024)

    Zhou, Y., Gu, J., Chiang, T.Y., Xiang, F., Su, H.: Point-sam: Promptable 3d segmentation model for point clouds. arXiv preprint arXiv:2406.17741 (2024) 1, 2, 5, 6, 7, 19

  55. [55]

    Zhu, H., Wang, Y., Huang, D., Ye, W., Ouyang, W., He, T.: Point cloud matters: Rethinking the impactofdifferentobservationspacesonrobotlearning.AdvancesinNeuralInformationProcessing Systems37, 77799–77830 (2024) 10, 22

  56. [56]

    In: Con- ference on Robot Learning

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Con- ference on Robot Learning. pp. 2165–2183. PMLR (2023) 1, 2 16 A More Real-world Experiments To further illustrate the performance of our approach across more ...