arxiv: 2403.03954 · v7 · submitted 2024-03-06 · 💻 cs.RO · cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze , Gu Zhang , Kangning Zhang , Chenyuan Hu , Muhan Wang , Huazhe Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords 3D diffusion policyvisuomotor policyimitation learningpoint cloud representationrobot manipulationdiffusion modelsgeneralization

0 comments

The pith

A compact 3D point-cloud representation lets diffusion policies learn precise robot manipulation from only ten demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3D Diffusion Policy, a method that combines diffusion-based action generation with a compact 3D visual feature extracted from sparse point clouds. This design allows the policy to master complex visuomotor tasks using far fewer human demonstrations than prior approaches. Experiments across 72 simulation tasks show that the method succeeds on most tasks with 10 demonstrations and improves performance by 24.2 percent over baselines. On four real-robot tasks the policy reaches 85 percent success with 40 demonstrations per task, generalizes to changes in position, viewpoint, appearance and object instance, and avoids safety violations that plague image-based baselines.

Core claim

By conditioning a diffusion policy on a compact 3D representation extracted from sparse point clouds via an efficient point encoder, DP3 produces actions that achieve high success rates with minimal demonstrations and exhibit strong generalization in simulation and on physical robots.

What carries the argument

The compact 3D visual representation obtained from sparse point clouds through an efficient point encoder, which conditions the diffusion model for action generation.

If this is right

Most simulated tasks become solvable with only 10 demonstrations instead of hundreds.
Real-robot success reaches 85 percent with 40 demonstrations while preserving safety constraints.
Generalization holds across spatial shifts, camera viewpoints, object appearances, and specific instances.
Baseline methods that rely on 2D images violate safety rules far more often and require human intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 3D encoding works for these tasks, similar geometric priors could reduce data needs in other control domains such as locomotion or navigation.
Adding temporal consistency to the point-cloud encoder might further improve performance on long-horizon tasks.
The safety benefit implies that explicit 3D structure helps models reason about collisions more reliably than raw images.
Deploying the same pipeline on different robot platforms could test whether the representation is embodiment-agnostic.

Load-bearing premise

The assumption that a compact 3D point-cloud encoding captures every piece of task-relevant geometry without missing critical details that only richer visual or tactile data would provide.

What would settle it

Running the real-robot experiments again with objects whose shapes differ markedly from those seen in the 40 training demonstrations; a sharp drop in success rate would indicate that the 3D representation does not generalize as claimed.

read the original abstract

Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DP3 shows that a compact point-cloud encoder inside a diffusion policy can cut demonstration needs and improve real-robot generalization on the tested tasks.

read the letter

The main thing to know is that this paper takes a standard diffusion policy and swaps the usual image encoder for a simple point-cloud one, then reports clear gains in sample efficiency and cross-condition robustness on both sim and real hardware. With 10 demos it beats baselines by 24% relative across 72 simulation tasks, and on four real-robot tasks it hits 85% success with 40 demos while generalizing to new positions, views, appearances, and object instances. The safety observation—that the policy rarely needs human intervention—is also worth noting because it is not common in these comparisons. They release code and data, which makes the numbers checkable. That combination of modest architectural change plus concrete real-robot numbers is the useful part. The experiments appear to rest on direct head-to-head runs rather than any circular fitting, so the central claim is at least falsifiable. The soft spot is the vision side. The abstract and stress-test note both flag that sparse point clouds can drop fine surface details needed for tight contact. If the four real tasks are mostly loose pick-and-place, the 85% number holds; if any involve insertion or precise alignment, the encoder may be overfitting to the exact point distributions in the 40 demos rather than learning the underlying geometry. The paper would benefit from an ablation that varies point density or adds normals to show this is not the case. Overall this is aimed at people doing imitation learning who want something that works with limited data and runs on real arms. A reader already working on diffusion policies or 3D representations for robotics will find the empirical comparisons and released artifacts directly usable. It is solid enough on the results side to deserve a serious referee even if the vision ablation needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces 3D Diffusion Policy (DP3), which augments diffusion policies with a compact 3D visual representation extracted from sparse point clouds using an efficient point encoder. It reports that DP3 solves most of 72 simulation tasks with only 10 demonstrations and achieves a 24.2% relative improvement over baselines; in four real-robot tasks it reaches 85% success with 40 demonstrations per task while exhibiting generalization across space, viewpoint, appearance, and instance, and rarely violating safety constraints unlike baselines.

Significance. If the experimental claims hold under fuller controls, the work demonstrates that lightweight 3D representations can markedly improve sample efficiency and robustness in visuomotor imitation learning, offering a practical route to reduce demonstration requirements for real-world manipulation. The public release of code, data, and videos strengthens reproducibility.

major comments (3)

[§4.2, Table 1] §4.2 and Table 1: the 24.2% relative improvement is stated without detailing baseline implementations (e.g., whether they receive identical point-cloud input or 2D RGB, network sizes, or training protocols), preventing attribution of gains specifically to the 3D representation versus the diffusion backbone.
[§5.1–5.3] §5.1–5.3: success rates (85% on real robots) and generalization claims are reported without variance, number of trials, or statistical tests; with only 40 demonstrations per task, this leaves open whether results are robust or sensitive to demonstration selection.
[§3.1] §3.1: the point encoder is described as operating on sparse clouds, yet no ablation varies point density, adds surface normals, or tests partial occlusions; this directly bears on the central assumption that the representation suffices for contact-rich geometry in the reported real-robot tasks.

minor comments (2)

[Figure 3, §4.3] Figure 3 and §4.3: axis labels and legend entries are too small for print; success-rate plots would benefit from error bars or per-seed scatter.
[§2] §2: the notation for the conditional diffusion process mixes p_θ and the 3D encoder output without a clear diagram or equation linking them; a single-line definition would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive suggestions. We address each major comment below and have updated the manuscript accordingly to improve clarity and robustness of the presented results.

read point-by-point responses

Referee: [§4.2, Table 1] §4.2 and Table 1: the 24.2% relative improvement is stated without detailing baseline implementations (e.g., whether they receive identical point-cloud input or 2D RGB, network sizes, or training protocols), preventing attribution of gains specifically to the 3D representation versus the diffusion backbone.

Authors: We appreciate this observation. To ensure a fair comparison, all baseline methods were adapted to process the same sparse point cloud inputs as DP3, with network sizes and training protocols matched as closely as possible to the original implementations. We have revised Section 4.2 to explicitly detail the input modalities, architectures, and hyperparameters for each baseline, including a new table in the appendix summarizing these aspects. This allows for clearer attribution of the performance improvements to the 3D representation. revision: yes
Referee: [§5.1–5.3] §5.1–5.3: success rates (85% on real robots) and generalization claims are reported without variance, number of trials, or statistical tests; with only 40 demonstrations per task, this leaves open whether results are robust or sensitive to demonstration selection.

Authors: We agree that including variance and trial details would enhance the presentation. In the revised version, we have added the number of evaluation trials conducted (20 independent trials per task for real-robot experiments), reported standard deviations for success rates, and included a brief statistical analysis. Regarding sensitivity to demonstration selection, the 40 demonstrations were collected consistently by the same operator, and the strong performance across diverse tasks indicates robustness; we have added a discussion on this point in Section 5. revision: yes
Referee: [§3.1] §3.1: the point encoder is described as operating on sparse clouds, yet no ablation varies point density, adds surface normals, or tests partial occlusions; this directly bears on the central assumption that the representation suffices for contact-rich geometry in the reported real-robot tasks.

Authors: The referee raises a valid point about the need for more ablations on the point cloud representation. While we did not perform exhaustive ablations on point density variations, addition of normals, or simulated occlusions in the original submission, our real-robot experiments directly use sparse point clouds from the sensor in contact-rich scenarios, demonstrating practical sufficiency. To address this, we have added a discussion in Section 3.1 explaining the design choice for sparse representations and included a limited ablation on point density in the supplementary material. Full additional experiments on normals and occlusions would require substantial new data collection and are left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method validated by direct experiments

full rationale

The paper introduces DP3 as an algorithmic combination of a point-cloud encoder producing compact 3D features and a conditional diffusion policy for action generation. All reported results (success rates, generalization metrics, relative improvements) are obtained from training on limited demonstration sets and evaluating on held-out simulation and real-robot tasks. No derivation chain, equations, or first-principles predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citations. The central claims rest on external experimental benchmarks rather than any internal reduction, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper in robotics. No new mathematical axioms, free parameters beyond standard diffusion-model training, or invented physical entities are introduced in the abstract. The approach inherits standard assumptions from imitation learning and diffusion models.

pith-pipeline@v0.9.0 · 5550 in / 1149 out tokens · 41788 ms · 2026-05-15T01:45:13.165980+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-time Sparsity for Extreme Fast Action Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
cs.RO 2026-04 unverdicted novelty 7.0

A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
cs.RO 2026-05 unverdicted novelty 6.0

TAIL-Safe learns a Lipschitz Q-function from digital-twin failure data to identify an empirical control-invariant safe set for imitation learning policies and applies gradient-based recovery to keep actions inside it.
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
cs.RO 2026-04 conditional novelty 6.0

FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
ShapeGen: Robotic Data Generation for Category-Level Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies
cs.RO 2026-05 unverdicted novelty 5.0

TAIL-Safe learns a Lipschitz Q-function from visibility, recognizability, and graspability criteria in a Gaussian Splatting twin to define an empirical safe set for IL policies and recovers unsafe actions via Nagumo-i...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators
cs.RO 2026-04 unverdicted novelty 5.0

FastGrasp uses two-stage RL with CVAE for diverse grasp candidates from point clouds and tactile sensing for impact adjustments to achieve robust fast whole-body grasping in sim and real-world settings.
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
cs.CV 2026-04 unverdicted novelty 5.0

UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.
Robot Learning from Human Videos: A Survey
cs.RO 2026-04 unverdicted novelty 2.0

The survey organizes human-video-based robot learning into task-, observation-, and action-oriented transfer pathways, reviews associated datasets, and outlines challenges for scalable embodied AI.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

Dexterous functional grasping

Ananye Agarwal, Shagun Uppal, Kenneth Shaw, and Deepak Pathak. Dexterous functional grasping. In CoRL, 2023

work page 2023
[2]

Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657 , 2022

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657 , 2022

work page arXiv 2022
[3]

Dexterous imitation made easy: A learning-based framework for efficient dexterous manip- ulation

Sridhar Pandian Arunachalam, Sneha Silwal, Ben Evans, and Lerrel Pinto. Dexterous imitation made easy: A learning-based framework for efficient dexterous manip- ulation. In ICRA, 2023

work page 2023
[4]

Layer normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv, 2016

work page 2016
[5]

Dexart: Benchmarking generalizable dexterous manipu- lation with articulated objects

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipu- lation with articulated objects. In CVPR, 2023

work page 2023
[6]

A system for general in-hand object re-orientation

Tao Chen, Jie Xu, and Pulkit Agrawal. A system for general in-hand object re-orientation. In CoRL, 2022

work page 2022
[7]

Visual dexter- ity: In-hand reorientation of novel and complex object shapes

Tao Chen, Megha Tippur, Siyang Wu, Vikash Kumar, Edward Adelson, and Pulkit Agrawal. Visual dexter- ity: In-hand reorientation of novel and complex object shapes. Science Robotics , 8(84):eadc9244, 2023. doi: 10.1126/scirobotics.adc9244

work page doi:10.1126/scirobotics.adc9244 2023
[8]

Towards human-level bimanual dexterous manipulation with rein- forcement learning

Yuanpei Chen, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuan Jiang, Zongqing Lu, Stephen McAleer, Hao Dong, Song-Chun Zhu, and Yaodong Yang. Towards human-level bimanual dexterous manipulation with rein- forcement learning. NeurIPS, 2022

work page 2022
[9]

Sequential dexterity: Chaining dexterous policies for long-horizon manipulation

Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. CoRL, 2023

work page 2023
[10]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. RSS, 2023

work page 2023
[11]

Implicit behavioral cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In CoRL, 2022

work page 2022
[12]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low- cost whole-body teleoperation. In arXiv, 2024

work page 2024
[13]

Act3d: Infinite resolution action detection transformer for robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. arXiv preprint arXiv:2306.17817, 2023

work page arXiv 2023
[14]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. arXiv, 2023

work page 2023
[15]

Scaling up and distilling down: Language-guided robot skill acquisition

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning . PMLR, 2023

work page 2023
[16]

Teach a robot to fish: Versatile imitation from one minute of demonstrations

Siddhant Haldar, Jyothish Pari, Anant Rai, and Lerrel Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations. RSS, 2023

work page 2023
[17]

Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020

work page 2020
[18]

Dextreme: Transfer of agile in-hand manipulation from simulation to reality

Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In ICRA, 2023

work page 2023
[19]

Stabilizing deep q-learning with convnets and vision transformers under data augmentation

Nicklas Hansen, Hao Su, and Xiaolong Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in neural information processing systems, 2021

work page 2021
[20]

On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline

Nicklas Hansen, Zhecheng Yuan, Yanjie Ze, Tongzhou Mu, Aravind Rajeswaran, Hao Su, Huazhe Xu, and Xiaolong Wang. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. In Interna- tional Conference on Machine Learning (ICML) , 2022

work page 2022
[21]

Modem: Accel- erating visual model-based reinforcement learning with demonstrations

Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, and Aravind Rajeswaran. Modem: Accel- erating visual model-based reinforcement learning with demonstrations. In ICLR, 2023

work page 2023
[22]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv, 2023

work page 2023
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020

work page 2020
[24]

Dynamic handover: Throw and catch with bi- manual hands

Binghao Huang, Yuanpei Chen, Tianyu Wang, Yuzhe Qin, Yaodong Yang, Nikolay Atanasov, and Xiaolong Wang. Dynamic handover: Throw and catch with bi- manual hands. CoRL, 2023

work page 2023
[25]

Diffusion reward: Learning rewards via conditional video diffusion

Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion. arXiv, 2023

work page 2023
[26]

Plas- ticinelab: A soft-body manipulation benchmark with dif- ferentiable physics

Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plas- ticinelab: A soft-body manipulation benchmark with dif- ferentiable physics. arXiv, 2021

work page 2021
[27]

Planning with diffusion for flexible behavior synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv, 2022

work page 2022
[28]

3d diffuser actor: Policy diffusion with 3d scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. Arxiv, 2024

work page 2024
[29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 2023

work page 2023
[30]

Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization

Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, and Huazhe Xu. Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization. arXiv, 2023

work page 2023
[31]

Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics

Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Dexdeform: Dexterous deformable object manipulation with human demonstrations and differentiable physics. arXiv, 2023

work page 2023
[32]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv, 2022

work page 2022
[33]

Eureka: Human- level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. arXiv, 2023

work page 2023
[34]

Isaac gym: High performance gpu-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv, 2021

work page 2021
[35]

What matters in learning from offline human demonstra- tions for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demonstra- tions for robot manipulation. arXiv, 2021

work page 2021
[36]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM , 2021

work page 2021
[37]

Extracting reward functions from diffusion models

Felipe Nuti, Tim Franzmeyer, and Jo ˜ao F Henriques. Extracting reward functions from diffusion models. arXiv preprint arXiv:2306.01804, 2023

work page arXiv 2023
[38]

The surprising ef- fectiveness of representation learning for visual imitation

Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising ef- fectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511 , 2021

work page arXiv 2021
[39]

Imitating human behaviour with dif- fusion models

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcar- cel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with dif- fusion models. ICLR, 2023

work page 2023
[40]

Learning agile robotic locomotion skills by imitating animals

Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang- Wei Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. arXiv, 2020

work page 2020
[41]

Consistency policy: Accelerated visuo- motor policies via consistency distillation

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuo- motor policies via consistency distillation. In Robotics: Science and Systems , 2024

work page 2024
[42]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017

work page 2017
[43]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017

work page 2017
[44]

In-hand object rotation via rapid motor adaptation

Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation. In CoRL, 2023

work page 2023
[45]

General in-hand object rotation with vision and touch

Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. General in-hand object rotation with vision and touch. In CoRL, 2023

work page 2023
[46]

Pointnext: Revisiting pointnet++ with improved training and scaling strategies

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022

work page 2022
[47]

Dexmv: Im- itation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Im- itation learning for dexterous manipulation from human videos. In ECCV, 2022

work page 2022
[48]

Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system

Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dietor Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system. arXiv preprint arXiv:2307.04577, 2023

work page arXiv 2023
[49]

Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipula- tion with deep reinforcement learning and demonstra- tions. arXiv, 2017

work page 2017
[50]

Goal-conditioned imitation learning us- ing score-based diffusion policies

Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning us- ing score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

work page arXiv 2023
[51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

work page 2022
[52]

Edmp: Ensemble-of-costs-guided diffusion for motion planning

Kallol Saha, Vishal Mandadi, Jayaram Reddy, Ajit Srikanth, Aditya Agarwal, Bipasha Sen, Arun Singh, and Madhava Krishna. Edmp: Ensemble-of-costs-guided diffusion for motion planning. arXiv, 2023

work page 2023
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Deep imitation learning for humanoid loco-manipulation through human teleoperation

Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. Humanoids, 2023

work page 2023
[55]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In CoRL, 2023

work page 2023
[56]

Behavior transformers: Cloning k modes with one stone

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems , 2022

work page 2022
[57]

On bringing robots home

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv, 2023

work page 2023
[58]

Distilled feature fields enable few-shot language-guided manipulation

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931 , 2023

work page arXiv 2023
[59]

Robocook: Long-horizon elasto-plastic object manipulation with diverse tools

Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. Proceedings of the 7th Conference on Robot Learning (CoRL) , 2023

work page 2023
[60]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. In CoRL, 2023

work page 2023
[61]

Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement

Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, and Dieter Fox. Shelving, stacking, hanging: Relational pose diffusion for multi-modal rearrangement. arXiv preprint arXiv:2307.04751 , 2023

work page arXiv 2023
[62]

De- noising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. ICLR, 2021

work page 2021
[63]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. ICLR, 2021

work page 2021
[64]

Memory-consistent neural networks for imitation learning

Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, James Weimer, and Insup Lee. Memory-consistent neural networks for imitation learning. In The Twelfth Inter- national Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=R3Tf7LDdX4

work page 2024
[65]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012

work page 2012
[66]

Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion

Julen Urain, Niklas Funk, Jan Peters, and Georgia Chal- vatzaki. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023

work page 2023
[67]

Vrl3: A data-driven framework for visual deep reinforcement learning

Che Wang, Xufang Luo, Keith Ross, and Dongsheng Li. Vrl3: A data-driven framework for visual deep reinforcement learning. Advances in Neural Information Processing Systems, 2022

work page 2022
[68]

Mimicplay: Long-horizon imitation learning by watching human play

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. CoRL, 2023

work page 2023
[69]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. arXiv preprint arXiv:2403.07788 , 2024

work page arXiv 2024
[70]

Diffusion policies as an expressive policy class for offline reinforcement learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. ICLR, 2023

work page 2023
[71]

Learning score-based grasping primitive for human-assisting dexterous grasping

Tianhao Wu, Mingdong Wu, Jiyao Zhang, Yunchong Gan, and Hao Dong. Learning score-based grasping primitive for human-assisting dexterous grasping. In NeurIPS, 2023

work page 2023
[72]

Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation

Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation. In CoRL, 2023

work page 2023
[73]

Sapien: A simulated part-based interactive environment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environment. In CVPR, 2020

work page 2020
[74]

NeRFuser: Diffusion guided multi-task 3d policy learning, 2024

Ge Yan, Yueh-Hua Wu, and Xiaolong Wang. NeRFuser: Diffusion guided multi-task 3d policy learning, 2024. URL https://openreview.net/forum?id=8GmPLkO0oR

work page 2024
[75]

Movie: Visual model-based policy adaptation for view generalization

Sizhe Yang, Yanjie Ze, and Huazhe Xu. Movie: Visual model-based policy adaptation for view generalization. Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[76]

Rotating without seeing: Towards in-hand dexterity through touch

Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, and Xiaolong Wang. Rotating without seeing: Towards in-hand dexterity through touch. RSS, 2023

work page 2023
[77]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2020

work page 2020
[78]

Robot synesthesia: In-hand manip- ulation with visuotactile sensing

Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, and Xiaolong Wang. Robot synesthesia: In-hand manip- ulation with visuotactile sensing. arXiv, 2023

work page 2023
[79]

Pre-trained image encoder for generalizable visual reinforcement learning

Zhecheng Yuan, Zhengrong Xue, Bo Yuan, Xueqian Wang, Yi Wu, Yang Gao, and Huazhe Xu. Pre-trained image encoder for generalizable visual reinforcement learning. Advances in Neural Information Processing Systems, 2022

work page 2022
[80]

Visual reinforcement learning with self-supervised 3d representations

Yanjie Ze, Nicklas Hansen, Yinbo Chen, Mohit Jain, and Xiaolong Wang. Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 2023

work page 2023

Showing first 80 references.