Recognition: 3 theorem links
· Lean Theorem3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Pith reviewed 2026-05-15 01:45 UTC · model grok-4.3
The pith
A compact 3D point-cloud representation lets diffusion policies learn precise robot manipulation from only ten demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conditioning a diffusion policy on a compact 3D representation extracted from sparse point clouds via an efficient point encoder, DP3 produces actions that achieve high success rates with minimal demonstrations and exhibit strong generalization in simulation and on physical robots.
What carries the argument
The compact 3D visual representation obtained from sparse point clouds through an efficient point encoder, which conditions the diffusion model for action generation.
If this is right
- Most simulated tasks become solvable with only 10 demonstrations instead of hundreds.
- Real-robot success reaches 85 percent with 40 demonstrations while preserving safety constraints.
- Generalization holds across spatial shifts, camera viewpoints, object appearances, and specific instances.
- Baseline methods that rely on 2D images violate safety rules far more often and require human intervention.
Where Pith is reading between the lines
- If the 3D encoding works for these tasks, similar geometric priors could reduce data needs in other control domains such as locomotion or navigation.
- Adding temporal consistency to the point-cloud encoder might further improve performance on long-horizon tasks.
- The safety benefit implies that explicit 3D structure helps models reason about collisions more reliably than raw images.
- Deploying the same pipeline on different robot platforms could test whether the representation is embodiment-agnostic.
Load-bearing premise
The assumption that a compact 3D point-cloud encoding captures every piece of task-relevant geometry without missing critical details that only richer visual or tactile data would provide.
What would settle it
Running the real-robot experiments again with objects whose shapes differ markedly from those seen in the 40 training demonstrations; a sharp drop in success rate would indicate that the 3D representation does not generalize as claimed.
read the original abstract
Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces 3D Diffusion Policy (DP3), which augments diffusion policies with a compact 3D visual representation extracted from sparse point clouds using an efficient point encoder. It reports that DP3 solves most of 72 simulation tasks with only 10 demonstrations and achieves a 24.2% relative improvement over baselines; in four real-robot tasks it reaches 85% success with 40 demonstrations per task while exhibiting generalization across space, viewpoint, appearance, and instance, and rarely violating safety constraints unlike baselines.
Significance. If the experimental claims hold under fuller controls, the work demonstrates that lightweight 3D representations can markedly improve sample efficiency and robustness in visuomotor imitation learning, offering a practical route to reduce demonstration requirements for real-world manipulation. The public release of code, data, and videos strengthens reproducibility.
major comments (3)
- [§4.2, Table 1] §4.2 and Table 1: the 24.2% relative improvement is stated without detailing baseline implementations (e.g., whether they receive identical point-cloud input or 2D RGB, network sizes, or training protocols), preventing attribution of gains specifically to the 3D representation versus the diffusion backbone.
- [§5.1–5.3] §5.1–5.3: success rates (85% on real robots) and generalization claims are reported without variance, number of trials, or statistical tests; with only 40 demonstrations per task, this leaves open whether results are robust or sensitive to demonstration selection.
- [§3.1] §3.1: the point encoder is described as operating on sparse clouds, yet no ablation varies point density, adds surface normals, or tests partial occlusions; this directly bears on the central assumption that the representation suffices for contact-rich geometry in the reported real-robot tasks.
minor comments (2)
- [Figure 3, §4.3] Figure 3 and §4.3: axis labels and legend entries are too small for print; success-rate plots would benefit from error bars or per-seed scatter.
- [§2] §2: the notation for the conditional diffusion process mixes p_θ and the 3D encoder output without a clear diagram or equation linking them; a single-line definition would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive suggestions. We address each major comment below and have updated the manuscript accordingly to improve clarity and robustness of the presented results.
read point-by-point responses
-
Referee: [§4.2, Table 1] §4.2 and Table 1: the 24.2% relative improvement is stated without detailing baseline implementations (e.g., whether they receive identical point-cloud input or 2D RGB, network sizes, or training protocols), preventing attribution of gains specifically to the 3D representation versus the diffusion backbone.
Authors: We appreciate this observation. To ensure a fair comparison, all baseline methods were adapted to process the same sparse point cloud inputs as DP3, with network sizes and training protocols matched as closely as possible to the original implementations. We have revised Section 4.2 to explicitly detail the input modalities, architectures, and hyperparameters for each baseline, including a new table in the appendix summarizing these aspects. This allows for clearer attribution of the performance improvements to the 3D representation. revision: yes
-
Referee: [§5.1–5.3] §5.1–5.3: success rates (85% on real robots) and generalization claims are reported without variance, number of trials, or statistical tests; with only 40 demonstrations per task, this leaves open whether results are robust or sensitive to demonstration selection.
Authors: We agree that including variance and trial details would enhance the presentation. In the revised version, we have added the number of evaluation trials conducted (20 independent trials per task for real-robot experiments), reported standard deviations for success rates, and included a brief statistical analysis. Regarding sensitivity to demonstration selection, the 40 demonstrations were collected consistently by the same operator, and the strong performance across diverse tasks indicates robustness; we have added a discussion on this point in Section 5. revision: yes
-
Referee: [§3.1] §3.1: the point encoder is described as operating on sparse clouds, yet no ablation varies point density, adds surface normals, or tests partial occlusions; this directly bears on the central assumption that the representation suffices for contact-rich geometry in the reported real-robot tasks.
Authors: The referee raises a valid point about the need for more ablations on the point cloud representation. While we did not perform exhaustive ablations on point density variations, addition of normals, or simulated occlusions in the original submission, our real-robot experiments directly use sparse point clouds from the sensor in contact-rich scenarios, demonstrating practical sufficiency. To address this, we have added a discussion in Section 3.1 explaining the design choice for sparse representations and included a limited ablation on point density in the supplementary material. Full additional experiments on normals and occlusions would require substantial new data collection and are left for future work. revision: partial
Circularity Check
No circularity: empirical method validated by direct experiments
full rationale
The paper introduces DP3 as an algorithmic combination of a point-cloud encoder producing compact 3D features and a conditional diffusion policy for action generation. All reported results (success rates, generalization metrics, relative improvements) are obtained from training on limited demonstration sets and evaluating on held-out simulation and real-robot tasks. No derivation chain, equations, or first-principles predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citations. The central claims rest on external experimental benchmarks rather than any internal reduction, satisfying the criteria for a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
Test-time Sparsity for Extreme Fast Action Diffusion
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
TAIL-Safe learns a Lipschitz Q-function from digital-twin failure data to identify an empirical control-invariant safe set for imitation learning policies and applies gradient-based recovery to keep actions inside it.
-
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
-
ShapeGen: Robotic Data Generation for Category-Level Manipulation
ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
-
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
TAIL-Safe learns a Lipschitz Q-function from visibility, recognizability, and graspability criteria in a Gaussian Splatting twin to define an empirical safe set for IL policies and recovers unsafe actions via Nagumo-i...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators
FastGrasp uses two-stage RL with CVAE for diverse grasp candidates from point clouds and tactile sensing for impact adjustments to achieve robust fast whole-body grasping in sim and real-world settings.
-
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.
-
Robot Learning from Human Videos: A Survey
The survey organizes human-video-based robot learning into task-, observation-, and action-oriented transfer pathways, reviews associated datasets, and outlines challenges for scalable embodied AI.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
Ananye Agarwal, Shagun Uppal, Kenneth Shaw, and Deepak Pathak. Dexterous functional grasping. In CoRL, 2023
work page 2023
-
[2]
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657 , 2022
-
[3]
Dexterous imitation made easy: A learning-based framework for efficient dexterous manip- ulation
Sridhar Pandian Arunachalam, Sneha Silwal, Ben Evans, and Lerrel Pinto. Dexterous imitation made easy: A learning-based framework for efficient dexterous manip- ulation. In ICRA, 2023
work page 2023
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv, 2016
work page 2016
-
[5]
Dexart: Benchmarking generalizable dexterous manipu- lation with articulated objects
Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipu- lation with articulated objects. In CVPR, 2023
work page 2023
-
[6]
A system for general in-hand object re-orientation
Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In CoRL, 2022
work page 2022
-
[7]
Visual dexter- ity: In-hand reorientation of novel and complex object shapes
Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Edward Adelson, and Pulkit Agrawal. Visual dexter- ity: In-hand reorientation of novel and complex object shapes. Science Robotics , 8(84):eadc9244, 2023. doi: 10.1126/scirobotics.adc9244
-
[8]
Towards human-level bimanual dexterous manipulation with rein- forcement learning
Yuanpei Chen, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen McAleer, Hao Dong, Song-Chun Zhu, and Yaodong Yang. Towards human-level bimanual dexterous manipulation with rein- forcement learning. NeurIPS, 2022
work page 2022
-
[9]
Sequential dexterity: Chaining dexterous policies for long-horizon manipulation
Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. CoRL, 2023
work page 2023
-
[10]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. RSS, 2023
work page 2023
-
[11]
Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In CoRL, 2022
work page 2022
-
[12]
Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation. In arXiv, 2024
work page 2024
-
[13]
Act3d: Infinite resolution action detection transformer for robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. arXiv preprint arXiv:2306.17817, 2023
-
[14]
Rvt: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv, 2023
work page 2023
-
[15]
Scaling up and distilling down: Language-guided robot skill acquisition
Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning . PMLR, 2023
work page 2023
-
[16]
Teach a robot to fish: Versatile imitation from one minute of demonstrations
Siddhant Haldar, Jyothish Pari, Anant Rai, and Lerrel Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations. RSS, 2023
work page 2023
-
[17]
Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system
Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020
work page 2020
-
[18]
Dextreme: Transfer of agile in-hand manipulation from simulation to reality
Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In ICRA, 2023
work page 2023
-
[19]
Stabilizing deep q-learning with convnets and vision transformers under data augmentation
Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in neural information processing systems, 2021
work page 2021
-
[20]
On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline
Nicklas Hansen, Zhecheng Yuan, Yanjie Ze, Tongzhou Mu, Aravind Rajeswaran, Hao Su, Huazhe Xu, and Xiaolong Wang. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. In Interna- tional Conference on Machine Learning (ICML) , 2022
work page 2022
-
[21]
Modem: Accel- erating visual model-based reinforcement learning with demonstrations
Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, and Aravind Rajeswaran. Modem: Accel- erating visual model-based reinforcement learning with demonstrations. In ICLR, 2023
work page 2023
-
[22]
Td-mpc2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv, 2023
work page 2023
-
[23]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020
work page 2020
-
[24]
Dynamic handover: Throw and catch with bi- manual hands
Binghao Huang, Yuanpei Chen, Tianyu Wang, Yuzhe Qin, Yaodong Yang, Nikolay Atanasov, and Xiaolong Wang. Dynamic handover: Throw and catch with bi- manual hands. CoRL, 2023
work page 2023
-
[25]
Diffusion reward: Learning rewards via conditional video diffusion
Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion. arXiv, 2023
work page 2023
-
[26]
Plas- ticinelab: A soft-body manipulation benchmark with dif- ferentiable physics
Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plas- ticinelab: A soft-body manipulation benchmark with dif- ferentiable physics. arXiv, 2021
work page 2021
-
[27]
Planning with diffusion for flexible behavior synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv, 2022
work page 2022
-
[28]
3d diffuser actor: Policy diffusion with 3d scene representations
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. Arxiv, 2024
work page 2024
-
[29]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 2023
work page 2023
-
[30]
Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, and Huazhe Xu. Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization. arXiv, 2023
work page 2023
-
[31]
Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics. arXiv, 2023
work page 2023
-
[32]
Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv, 2022
work page 2022
-
[33]
Eureka: Human- level reward design via coding large language models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. arXiv, 2023
work page 2023
-
[34]
Isaac gym: High performance gpu-based physics simulation for robot learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv, 2021
work page 2021
-
[35]
What matters in learning from offline human demonstra- tions for robot manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstra- tions for robot manipulation. arXiv, 2021
work page 2021
-
[36]
Nerf: Representing scenes as neural radiance fields for view synthesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM , 2021
work page 2021
-
[37]
Extracting reward functions from diffusion models
Felipe Nuti, Tim Franzmeyer, and Jo ˜ao F Henriques. Extracting reward functions from diffusion models. arXiv preprint arXiv:2306.01804, 2023
-
[38]
The surprising ef- fectiveness of representation learning for visual imitation
Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising ef- fectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511 , 2021
-
[39]
Imitating human behaviour with dif- fusion models
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with dif- fusion models. ICLR, 2023
work page 2023
-
[40]
Learning agile robotic locomotion skills by imitating animals
Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang- Wei Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. arXiv, 2020
work page 2020
-
[41]
Consistency policy: Accelerated visuo- motor policies via consistency distillation
Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuo- motor policies via consistency distillation. In Robotics: Science and Systems , 2024
work page 2024
-
[42]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017
work page 2017
-
[43]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017
work page 2017
-
[44]
In-hand object rotation via rapid motor adaptation
Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation. In CoRL, 2023
work page 2023
-
[45]
General in-hand object rotation with vision and touch
Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In CoRL, 2023
work page 2023
-
[46]
Pointnext: Revisiting pointnet++ with improved training and scaling strategies
Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022
work page 2022
-
[47]
Dexmv: Im- itation learning for dexterous manipulation from human videos
Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Im- itation learning for dexterous manipulation from human videos. In ECCV, 2022
work page 2022
-
[48]
Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system
Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dietor Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. arXiv preprint arXiv:2307.04577, 2023
-
[49]
Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions. arXiv, 2017
work page 2017
-
[50]
Goal-conditioned imitation learning us- ing score-based diffusion policies
Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning us- ing score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023
-
[51]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
work page 2022
-
[52]
Edmp: Ensemble-of-costs-guided diffusion for motion planning
Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, and Madhava Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. arXiv, 2023
work page 2023
-
[53]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Deep imitation learning for humanoid loco-manipulation through human teleoperation
Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. Humanoids, 2023
work page 2023
-
[55]
Masked world models for visual control
Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In CoRL, 2023
work page 2023
-
[56]
Behavior transformers: Cloning k modes with one stone
Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems , 2022
work page 2022
-
[57]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv, 2023
work page 2023
-
[58]
Distilled feature fields enable few-shot language-guided manipulation
William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931 , 2023
-
[59]
Robocook: Long-horizon elasto-plastic object manipulation with diverse tools
Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. Proceedings of the 7th Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[60]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. In CoRL, 2023
work page 2023
-
[61]
Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement
Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, and Dieter Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751 , 2023
-
[62]
De- noising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. ICLR, 2021
work page 2021
-
[63]
Score- based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. ICLR, 2021
work page 2021
-
[64]
Memory-consistent neural networks for imitation learning
Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, James Weimer, and Insup Lee. Memory-consistent neural networks for imitation learning. In The Twelfth Inter- national Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=R3Tf7LDdX4
work page 2024
-
[65]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012
work page 2012
-
[66]
Julen Urain, Niklas Funk, Jan Peters, and Georgia Chal- vatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023
work page 2023
-
[67]
Vrl3: A data-driven framework for visual deep reinforcement learning
Che Wang, Xufang Luo, Keith Ross, and Dongsheng Li. Vrl3: A data-driven framework for visual deep reinforcement learning. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[68]
Mimicplay: Long-horizon imitation learning by watching human play
Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. CoRL, 2023
work page 2023
-
[69]
Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. arXiv preprint arXiv:2403.07788 , 2024
-
[70]
Diffusion policies as an expressive policy class for offline reinforcement learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. ICLR, 2023
work page 2023
-
[71]
Learning score-based grasping primitive for human-assisting dexterous grasping
Tianhao Wu, Mingdong Wu, Jiyao Zhang, Yunchong Gan, and Hao Dong. Learning score-based grasping primitive for human-assisting dexterous grasping. In NeurIPS, 2023
work page 2023
-
[72]
Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation
Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation. In CoRL, 2023
work page 2023
-
[73]
Sapien: A simulated part-based interactive environment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In CVPR, 2020
work page 2020
-
[74]
NeRFuser: Diffusion guided multi-task 3d policy learning, 2024
Ge Yan, Yueh-Hua Wu, and Xiaolong Wang. NeRFuser: Diffusion guided multi-task 3d policy learning, 2024. URL https://openreview.net/forum?id=8GmPLkO0oR
work page 2024
-
[75]
Movie: Visual model-based policy adaptation for view generalization
Sizhe Yang, Yanjie Ze, and Huazhe Xu. Movie: Visual model-based policy adaptation for view generalization. Annual Conference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[76]
Rotating without seeing: Towards in-hand dexterity through touch
Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. RSS, 2023
work page 2023
-
[77]
Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2020
work page 2020
-
[78]
Robot synesthesia: In-hand manip- ulation with visuotactile sensing
Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manip- ulation with visuotactile sensing. arXiv, 2023
work page 2023
-
[79]
Pre-trained image encoder for generalizable visual reinforcement learning
Zhecheng Yuan, Zhengrong Xue, Bo Yuan, Xueqian Wang, Yi Wu, Yang Gao, and Huazhe Xu. Pre-trained image encoder for generalizable visual reinforcement learning. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[80]
Visual reinforcement learning with self-supervised 3d representations
Yanjie Ze, Nicklas Hansen, Yinbo Chen, Mohit Jain, and Xiaolong Wang. Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.