EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

Kangchen Lv; Mingrui Yu; Shihefeng Wang; Xiang Li

arxiv: 2606.12965 · v1 · pith:646WH2IWnew · submitted 2026-06-11 · 💻 cs.RO

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

Shihefeng Wang , Kangchen Lv , Mingrui Yu , Xiang Li This is my paper

Pith reviewed 2026-06-27 06:34 UTC · model grok-4.3

classification 💻 cs.RO

keywords embodiment-agnostic policiesvisuomotor policiesdiffusion samplingcollision avoidancecross-embodiment transferimitation learningzero-shot deploymentjoint-space guidance

0 comments

The pith

EmbodiSteer steers embodiment-agnostic visuomotor policies into joint space with Jacobian updates after each denoising step to enable collision-free zero-shot deployment on new robot bodies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that Cartesian end-effector policies trained on diverse data can be deployed on specific robots by adding joint-space collision guidance during inference without retraining. This matters because end-effector abstraction ignores robot body constraints like collisions, limiting real-world use despite scalable learning. By lifting diffusion sampling to joint space via forward kinematics and applying guidance after each step, the method preserves learned behavior while avoiding collisions. Experiments show substantial reductions in collisions and gains in success rates on simulated and physical robots.

Core claim

With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior from the embodiment-agnostic Cartesian policy.

What carries the argument

Jacobian-based updates that lift inference-time diffusion sampling into the target robot's joint space via forward kinematics and apply collision-aware corrections after each denoising step.

If this is right

Collision rate drops by 46.1% with 28.5% higher task success across 9 simulated robots compared to Cartesian-only execution.
Physical robots see 90.0% collision rate reduction and 36.7% success rate increase in constrained scenarios.
Policy learning stays in Cartesian space while deployment becomes embodiment-aware at inference time.
Zero-shot cross-embodiment deployment becomes possible without retraining or embodiment-specific data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar guidance could extend to other constraints like joint velocity limits or torque bounds.
The approach might allow combining policies trained on body-free data with any new hardware without fine-tuning.
Testing on more complex tasks with dynamic obstacles could reveal limits of the post-denoising correction.

Load-bearing premise

Jacobian-based updates applied after each diffusion denoising step can steer joint trajectories away from collisions without materially distorting the distribution or end-effector behavior produced by the original Cartesian policy.

What would settle it

A test where applying the joint-space guidance causes the end-effector to miss the target by more than the original policy's error margin, or where collision rates remain unchanged despite the guidance.

Figures

Figures reproduced from arXiv: 2606.12965 by Kangchen Lv, Mingrui Yu, Shihefeng Wang, Xiang Li.

**Figure 1.** Figure 1: We present EMBODISTEER, an inference-time steering framework for embodimentaware deployment of embodiment-agnostic visuomotor policies. Given a trained Cartesian policy (left), EmbodiSteer lifts the sampling process into the target robot’s joint space and incorporates robot embodiment and obstacle guidance during denoising (middle), enabling zero-shot whole-body collision-aware execution across diverse r… view at source ↗

**Figure 2.** Figure 2: Overview of EMBODISTEER. (a) A trained Cartesian policy is lifted into the target-robot joint space for inference-time sampling. (b) CBF-inspired guidance steers sampled joint trajectories away from whole-body collisions while preserving end-effector behavior. (c) Visualization of realworld deployment across diverse robot embodiments on constrained manipulation tasks. relative translation and 6D rotation … view at source ↗

**Figure 3.** Figure 3: We evaluate EMBODISTEER on three manipulation tasks requiring both task completion and whole-body obstacle avoidance. Cartesian policies are trained from obstacle-free floatinggripper demonstrations and deployed zero-shot on 9 robot embodiments with test-time obstacles. into (5) yields the single-constraint QP: min ∆qi 1 2 ∆q ⊤ t−1,iHE [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-embodiment comparison on the PLACETOAST task, where the task success rate (left) and collision rate (right) across five representative robot embodiments are reported. showing that sampling in joint space maintains the learned end-effector behavior. With obstacles, however, Cartesian execution drops to 35.7% success and 0.614 reward, with 57.6% collision rate, revealing the limitation of end-effector-on… view at source ↗

**Figure 5.** Figure 5: Ablation for the constraint strength on TURNONFAUCET. Joint w/ CG exhibits large task-dependent variation. It performs competitively on PLACETOAST, where coarse end-effector motion is often sufficient, but struggles on TURNONFAUCET and MAKECOFFEE, which require precise interactions. Although collision-cost gradients can push the arm away from obstacles, they can also significantly distort the end-effecto… view at source ↗

**Figure 6.** Figure 6: Real-world deployment of an arm-agnostic Cartesian policy on UR5 and Panda. Without [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Whole-body SDF representation. The left image shows the real scene. The right image [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Scheduled constraint strength γt over reverse denoising steps. The guidance is weak in early noisy steps and approaches the base scale γ in later steps. µa⊤H−1a = ¯b and µ = ¯b/(a ⊤H−1a). The resulting solution is ∆q ⋆ =    0, b ≤ 0, b a⊤H−1a + ε H−1a, b > 0, (18) where ε is a small numerical stabilizer. The case b ≤ 0 corresponds to a locally satisfied safety constraint, for which the unconstrained m… view at source ↗

**Figure 9.** Figure 9: Runtime breakdown for one guided joint-space inference call on an RTX 4070 Ti SUPER. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: CBF-QP guidance-strength sensitivity analysis. The default [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Cost-gradient guidance-strength sensitivity analysis. Solid curves show the CG baseline [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Additional simulation qualitative results. The base Cartesian policy can collide at the [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Real-world task protocols. Each row shows representative stages of one UMI-trained [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Real-world obstacle layouts. The obstacles are highlighted with yellow masks. The [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Representative failure modes. Guidance may move the robot into out-of-distribution [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Real-world qualitative results. From top to bottom: M [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at https://frankwang67.github.io/EmbodiSteer-Page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmbodiSteer adds post-denoising Jacobian guidance in joint space to Cartesian diffusion policies for collision avoidance, but the claim that end-effector behavior stays materially unchanged lacks direct support.

read the letter

The main takeaway is that this paper gives a training-free way to steer embodiment-agnostic Cartesian policies into safe joint trajectories at inference time by lifting diffusion samples via forward kinematics and applying whole-body collision guidance after each denoising step.

What is new is the specific combination of that lifting with Jacobian-based corrections applied directly to the joint trajectories for zero-shot transfer. The approach keeps all policy training in Cartesian space, which supports scaling across heterogeneous data, then handles robot-specific constraints only at deployment. The reported numbers show clear practical effect: 46.1% collision reduction and 28.5% success improvement across nine simulated robots, with even larger gains on two physical platforms in tight workspaces.

The soft spots sit in the evaluation and the preservation assumption. The abstract states the gains but supplies no protocol, no baseline descriptions, no variance numbers, and no implementation specifics for the guidance strength or update frequency. More critically, the central claim that the corrections steer joints away from collisions while leaving learned end-effector behavior intact rests on an unexamined assumption that the Jacobian steps do not accumulate deviation in Cartesian space or shift the sample distribution. Success rates alone do not confirm non-distortion; the stress-test concern about the updates failing to commute with the denoising process is worth checking against the full experiments.

The work is aimed at robotics researchers who train on mixed-embodiment data and then need to deploy on specific hardware without retraining. A reader focused on imitation learning deployment would get value from the idea and the reported numbers even if the preservation checks need strengthening.

I would send this to peer review because the deployment problem is real and the proposed mechanism is concrete, provided the full paper supplies the missing experimental details and direct tests of trajectory fidelity.

Referee Report

2 major / 0 minor

Summary. The paper presents EmbodiSteer, a training-free framework that steers embodiment-agnostic Cartesian visuomotor policies (learned via imitation on heterogeneous data) into joint space at inference time. It lifts diffusion denoising steps via forward kinematics and applies Jacobian-based whole-body collision-aware guidance after each step to avoid collisions while claiming to preserve the original end-effector behavior, reporting 46.1% collision reduction and 28.5% success improvement across 9 simulated robots plus 90.0% collision reduction and 36.7% success increase on two physical robots in constrained scenarios.

Significance. If the central claim holds with verifiable preservation of the learned Cartesian distribution, the approach would be significant for enabling zero-shot cross-embodiment deployment of policies without retraining or embodiment-specific data, addressing a practical bottleneck in scalable robot learning. The training-free design and reported gains on both simulation and hardware are strengths that could influence deployment practices if the non-distortion property is rigorously established.

major comments (2)

[Abstract] Abstract (paragraph describing the framework): The central claim that Jacobian-based updates after each denoising step steer joint trajectories away from collisions 'while preserving learned end-effector behavior' rests on an unexamined assumption that these post-denoising corrections commute with the diffusion process and keep samples on the original Cartesian manifold; no preservation guarantees, error bounds, or analysis of accumulated Cartesian deviations are provided, which is load-bearing for the claim that end-effector behavior remains materially unchanged.
[Abstract] Abstract: The reported performance gains (46.1% collision reduction, 28.5% success increase on 9 sim robots; 90.0% and 36.7% on physical) cannot be evaluated because the abstract supplies no experimental protocol, baseline comparisons, number of trials, variance measures, or implementation specifics for the guidance term (e.g., how the collision cost is formulated or its weighting relative to the denoising process), undermining the data-to-claim link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph describing the framework): The central claim that Jacobian-based updates after each denoising step steer joint trajectories away from collisions 'while preserving learned end-effector behavior' rests on an unexamined assumption that these post-denoising corrections commute with the diffusion process and keep samples on the original Cartesian manifold; no preservation guarantees, error bounds, or analysis of accumulated Cartesian deviations are provided, which is load-bearing for the claim that end-effector behavior remains materially unchanged.

Authors: We agree that the abstract does not supply formal preservation guarantees, error bounds, or accumulated deviation analysis. The design applies small Jacobian-based corrections after each denoising step so that the diffusion process can continue from a feasible joint configuration; because the pseudo-inverse Jacobian maps the correction primarily into redundant degrees of freedom, end-effector deviation remains local. We will revise the abstract to qualify the claim as approximate preservation and add a short empirical deviation analysis plus discussion of the approximation in the methods section of the revised manuscript. revision: yes
Referee: [Abstract] Abstract: The reported performance gains (46.1% collision reduction, 28.5% success increase on 9 sim robots; 90.0% and 36.7% on physical) cannot be evaluated because the abstract supplies no experimental protocol, baseline comparisons, number of trials, variance measures, or implementation specifics for the guidance term (e.g., how the collision cost is formulated or its weighting relative to the denoising process), undermining the data-to-claim link.

Authors: Abstract length constraints prevent inclusion of full protocol details. The complete experimental protocol (trial counts, variance reporting, baselines, collision-cost formulation, and guidance weighting) appears in Sections 4–6 and the supplement. We will revise the abstract to include a concise statement of the evaluation scale and key implementation parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The provided abstract and framework description introduce EmbodiSteer as a training-free method that applies Jacobian-based updates after diffusion denoising steps for collision avoidance. No equations, predictions, or claims in the text reduce reported gains (e.g., collision rate reduction) to quantities defined by fitted parameters or self-referential definitions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via citation are present. The central claim rests on the proposed guidance mechanism itself rather than reducing to its inputs by construction. This is the expected honest non-finding for a methods paper whose performance metrics are externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method invokes standard forward kinematics and Jacobian pseudoinverse operations that are assumed from prior robotics literature.

pith-pipeline@v0.9.1-grok · 5770 in / 1130 out tokens · 25612 ms · 2026-06-27T06:34:54.799629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 3 canonical work pages

[1]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[2]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

Pith/arXiv arXiv 2024
[3]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[5]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[6]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[7]

Y . Wang, S. Zheng, H. Luo, W. Zhang, H. Yuan, C. Xu, H. Xu, Y . Feng, M. Yu, Z. Kang, et al. Rethinking visual-language-action model scaling: Alignment, mixture, and regulariza- tion.arXiv preprint arXiv:2602.09722, 2026

arXiv 2026
[8]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

Pith/arXiv arXiv 2024
[9]

Bauer, E

E. Bauer, E. Nava, and R. K. Katzschmann. Latent action diffusion for cross-embodiment manipulation.arXiv preprint arXiv:2506.14608, 2025

arXiv 2025
[10]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[11]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers.Advances in neural information processing systems, 37: 124420–124450, 2024

2024
[12]

M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song. Xskill: Cross embodiment skill discovery. In Conference on robot learning, pages 3536–3555. PMLR, 2023

2023
[13]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majum- dar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026

arXiv 2026
[14]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025
[15]

Patel and S

A. Patel and S. Song. Get-zero: Graph embodiment transformer for zero-shot embodiment generalization. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14262–14269. IEEE, 2025. 9

2025
[16]

Gupta, L

A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei. Metamorph: Learning universal controllers with transformers.arXiv preprint arXiv:2203.11931, 2022

arXiv 2022
[17]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

arXiv 2025
[18]

H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

arXiv 2024
[19]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[20]

Sferrazza, D.-M

C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel. Body transformer: Leveraging robot embodiment for policy learning.arXiv preprint arXiv:2408.06316, 2024

arXiv 2024
[21]

Bohlinger, G

N. Bohlinger, G. Czechmanowski, M. Krupka, P. Kicki, K. Walas, J. Peters, and D. Tateo. One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion. arXiv preprint arXiv:2409.06366, 2024

arXiv 2024
[22]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026
[23]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[24]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026
[25]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

Pith/arXiv arXiv 2025
[26]

Y . Liu, J. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InInternational Conference on Learning Rep- resentations, volume 2025, pages 4594–4627, 2025

2025
[27]

R ¨omer, J

R. R ¨omer, J. Balletshofer, J. Thumm, M. Pavone, A. P. Schoellig, and M. Althoff. From demonstrations to safe deployment: Path-consistent safety filtering for diffusion policies.arXiv preprint arXiv:2511.06385, 2025

arXiv 2025
[28]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[29]

Reuss, M

M. Reuss, M. Li, X. Jia, and R. Lioutikov. Goal-conditioned imitation learning using score- based diffusion policies.arXiv preprint arXiv:2304.02532, 2023

arXiv 2023
[30]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[31]

K. M. Lee, S. Ye, Q. Xiao, Z. Wu, Z. Zaidi, D. B. D’Ambrosio, P. R. Sanketi, and M. C. Gombolay. Learning diverse robot striking motions with diffusion models and kinematically constrained gradient guidance. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 12017–12024, 2025. doi:10.1109/ICRA55743.2025.11127310. 10

work page doi:10.1109/icra55743.2025.11127310 2025
[32]

Xiao, T.-H

W. Xiao, T.-H. Wang, C. Gan, R. Hasani, M. Lechner, and D. Rus. Safediffuser: Safe planning with diffusion probabilistic models. InInternational Conference on Learning Representations, 2025

2025
[33]

Zhang, L

J. Zhang, L. Zhao, A. Papachristodoulou, and J. Umenberger. Constrained diffusers for safe planning and control.Advances in Neural Information Processing Systems, 38:34965–34998, 2026

2026
[34]

Zhong, Q

Y . Zhong, Q. Jiang, J. Yu, and Y . Ma. Dexgrasp anything: Towards universal robotic dex- terous grasping with physics awareness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22584–22594, 2025

2025
[35]

Z. Weng, H. Lu, D. Kragic, and J. Lundell. Dexdiffuser: Generating dexterous grasps with diffusion models.IEEE Robotics and Automation Letters, 9(12):11834–11840, 2024

2024
[36]

Y . Jia, Y . Jiang, K. Lv, Y . Ren, and X. Li. Arm-aware guided dexterous grasp generation with arm-agnostic grasp models.IEEE Robotics and Automation Letters, 11(5):5875–5882, 2026. doi:10.1109/LRA.2026.3674025

work page doi:10.1109/lra.2026.3674025 2026
[37]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion policies with active dynamic guidance. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[38]

Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Yang, Y .-W. Chao, C. P ´erez-D’Arpino, D. Fox, and J. Shah. Inference-time policy steering through human interactions. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15626–15633. IEEE, 2025

2025
[39]

Gupta, X

H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi. Umi- on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps: //arxiv.org/abs/2510.02614

Pith/arXiv arXiv 2026
[40]

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019

2019
[41]

S. Hu, Z. Liu, S. Liu, J. Cen, Z. Meng, and X. He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025

arXiv 2025
[42]

Brunke, Y

L. Brunke, Y . Zhang, R. R¨omer, J. Naimer, N. Staykov, S. Zhou, and A. P. Schoellig. Semanti- cally safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

2025
[43]

K. P. Wabersich and M. N. Zeilinger. A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.Automatica, 129:109597, 2021

2021
[44]

S. Gros, M. Zanon, and A. Bemporad. Safe reinforcement learning via projection on a safe set: How to achieve optimality?IF AC-PapersOnLine, 53(2):8076–8081, 2020

2020
[45]

X. Zhai, B. Ou, Y . Wang, H. Y . Leong, Q. Yu, C. Hao, and Y . Liu. Cofreevla: Collision-free dual-arm manipulation via vision-language-action model and risk estimation.arXiv preprint arXiv:2601.21712, 2026

arXiv 2026
[46]

H. Li, Q. Feng, Z. Zheng, J. Feng, Z. Chen, and A. Knoll. Language-guided object-centric dif- fusion policy for generalizable and collision-aware manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 12834–12841. IEEE, 2025

2025
[47]

H. Deng, W. Guo, Q. Wang, Z. Wu, and Z. Wang. Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025. 11

arXiv 2025
[48]

Dastider, H

A. Dastider, H. Fang, and M. Lin. Apex: Ambidextrous dual-arm robotic manipulation using collision-free generative diffusion models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9526–9533. IEEE, 2024

2024
[49]

K. Lv, M. Yu, Y . Jia, C. Zhang, and X. Li. Kinematics-aware diffusion policy with consistent 3d observation and action space for whole-arm robotic manipulation.IEEE Robotics and Automation Letters, 2026. doi:10.1109/LRA.2026.3685437

work page doi:10.1109/lra.2026.3685437 2026
[50]

Q. Lv, H. Li, X. Deng, R. Shao, Y . Li, J. Hao, L. Gao, M. Y . Wang, and L. Nie. Spatial- temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17394– 17404, 2025

2025
[51]

K. Chen, Z. Bi, G. Zhao, C. Zheng, Y . Li, H. Zhao, and J. Ma. Samp: Spatial anchor-based motion policy for collision-aware robotic manipulators.arXiv preprint arXiv:2509.11185, 2025

arXiv 2025
[52]

Fishman, A

A. Fishman, A. Walsman, M. Bhardwaj, W. Yuan, B. Sundaralingam, B. Boots, and D. Fox. Avoid everything: Model-free collision avoidance with expert-guided fine-tuning. InCoRL Workshop on Safe and Robust Robot Learning for Operation in the Real World, 2024

2024
[53]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019

2019
[54]

H. Ma, S. Bodmer, A. Carron, M. Zeilinger, and M. Muehlebach. Constraint-aware diffusion guidance for robotics: Real-time obstacle avoidance for autonomous racing.arXiv preprint arXiv:2505.13131, 2025

arXiv 2025
[55]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems,...

2025

[1] [1]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[2] [2]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

Pith/arXiv arXiv 2024

[3] [3]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[4] [4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[5] [5]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[6] [6]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[7] [7]

Y . Wang, S. Zheng, H. Luo, W. Zhang, H. Yuan, C. Xu, H. Xu, Y . Feng, M. Yu, Z. Kang, et al. Rethinking visual-language-action model scaling: Alignment, mixture, and regulariza- tion.arXiv preprint arXiv:2602.09722, 2026

arXiv 2026

[8] [8]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

Pith/arXiv arXiv 2024

[9] [9]

Bauer, E

E. Bauer, E. Nava, and R. K. Katzschmann. Latent action diffusion for cross-embodiment manipulation.arXiv preprint arXiv:2506.14608, 2025

arXiv 2025

[10] [10]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[11] [11]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers.Advances in neural information processing systems, 37: 124420–124450, 2024

2024

[12] [12]

M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song. Xskill: Cross embodiment skill discovery. In Conference on robot learning, pages 3536–3555. PMLR, 2023

2023

[13] [13]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y . Huang, D. Shah, A. Z. Ren, and A. Majum- dar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556, 2026

arXiv 2026

[14] [14]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025

2025

[15] [15]

Patel and S

A. Patel and S. Song. Get-zero: Graph embodiment transformer for zero-shot embodiment generalization. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14262–14269. IEEE, 2025. 9

2025

[16] [16]

Gupta, L

A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei. Metamorph: Learning universal controllers with transformers.arXiv preprint arXiv:2203.11931, 2022

arXiv 2022

[17] [17]

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song. Dexumi: Using hu- man hand as the universal manipulation interface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

arXiv 2025

[18] [18]

H. Ha, Y . Gao, Z. Fu, J. Tan, and S. Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

arXiv 2024

[19] [19]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[20] [20]

Sferrazza, D.-M

C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel. Body transformer: Leveraging robot embodiment for policy learning.arXiv preprint arXiv:2408.06316, 2024

arXiv 2024

[21] [21]

Bohlinger, G

N. Bohlinger, G. Czechmanowski, M. Krupka, P. Kicki, K. Walas, J. Peters, and D. Tateo. One policy to run them all: an end-to-end learning approach to multi-embodiment locomotion. arXiv preprint arXiv:2409.06366, 2024

arXiv 2024

[22] [22]

H. Luo, W. Zhang, Y . Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y . Fu, and Z. Lu. Being-h0. 7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

Pith/arXiv arXiv 2026

[23] [23]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[24] [24]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026

[25] [25]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

Pith/arXiv arXiv 2025

[26] [26]

Y . Liu, J. Hamid, A. Xie, Y . Lee, M. Du, and C. Finn. Bidirectional decoding: Improving action chunking via guided test-time sampling. InInternational Conference on Learning Rep- resentations, volume 2025, pages 4594–4627, 2025

2025

[27] [27]

R ¨omer, J

R. R ¨omer, J. Balletshofer, J. Thumm, M. Pavone, A. P. Schoellig, and M. Althoff. From demonstrations to safe deployment: Path-consistent safety filtering for diffusion policies.arXiv preprint arXiv:2511.06385, 2025

arXiv 2025

[28] [28]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[29] [29]

Reuss, M

M. Reuss, M. Li, X. Jia, and R. Lioutikov. Goal-conditioned imitation learning using score- based diffusion policies.arXiv preprint arXiv:2304.02532, 2023

arXiv 2023

[30] [30]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021

[31] [31]

K. M. Lee, S. Ye, Q. Xiao, Z. Wu, Z. Zaidi, D. B. D’Ambrosio, P. R. Sanketi, and M. C. Gombolay. Learning diverse robot striking motions with diffusion models and kinematically constrained gradient guidance. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 12017–12024, 2025. doi:10.1109/ICRA55743.2025.11127310. 10

work page doi:10.1109/icra55743.2025.11127310 2025

[32] [32]

Xiao, T.-H

W. Xiao, T.-H. Wang, C. Gan, R. Hasani, M. Lechner, and D. Rus. Safediffuser: Safe planning with diffusion probabilistic models. InInternational Conference on Learning Representations, 2025

2025

[33] [33]

Zhang, L

J. Zhang, L. Zhao, A. Papachristodoulou, and J. Umenberger. Constrained diffusers for safe planning and control.Advances in Neural Information Processing Systems, 38:34965–34998, 2026

2026

[34] [34]

Zhong, Q

Y . Zhong, Q. Jiang, J. Yu, and Y . Ma. Dexgrasp anything: Towards universal robotic dex- terous grasping with physics awareness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22584–22594, 2025

2025

[35] [35]

Z. Weng, H. Lu, D. Kragic, and J. Lundell. Dexdiffuser: Generating dexterous grasps with diffusion models.IEEE Robotics and Automation Letters, 9(12):11834–11840, 2024

2024

[36] [36]

Y . Jia, Y . Jiang, K. Lv, Y . Ren, and X. Li. Arm-aware guided dexterous grasp generation with arm-agnostic grasp models.IEEE Robotics and Automation Letters, 11(5):5875–5882, 2026. doi:10.1109/LRA.2026.3674025

work page doi:10.1109/lra.2026.3674025 2026

[37] [37]

Du and S

M. Du and S. Song. Dynaguide: Steering diffusion policies with active dynamic guidance. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

2025

[38] [38]

Y . Wang, L. Wang, Y . Du, B. Sundaralingam, X. Yang, Y .-W. Chao, C. P ´erez-D’Arpino, D. Fox, and J. Shah. Inference-time policy steering through human interactions. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15626–15633. IEEE, 2025

2025

[39] [39]

Gupta, X

H. Gupta, X. Guo, H. Ha, C. Pan, M. Cao, D. Lee, S. Scherer, S. Song, and G. Shi. Umi- on-air: Embodiment-aware guidance for embodiment-agnostic visuomotor policies. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026. URLhttps: //arxiv.org/abs/2510.02614

Pith/arXiv arXiv 2026

[40] [40]

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada. Control barrier functions: Theory and applications. In2019 18th European control conference (ECC), pages 3420–3431. Ieee, 2019

2019

[41] [41]

S. Hu, Z. Liu, S. Liu, J. Cen, Z. Meng, and X. He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025

arXiv 2025

[42] [42]

Brunke, Y

L. Brunke, Y . Zhang, R. R¨omer, J. Naimer, N. Staykov, S. Zhou, and A. P. Schoellig. Semanti- cally safe robot manipulation: From semantic scene understanding to motion safeguards.IEEE Robotics and Automation Letters, 2025

2025

[43] [43]

K. P. Wabersich and M. N. Zeilinger. A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.Automatica, 129:109597, 2021

2021

[44] [44]

S. Gros, M. Zanon, and A. Bemporad. Safe reinforcement learning via projection on a safe set: How to achieve optimality?IF AC-PapersOnLine, 53(2):8076–8081, 2020

2020

[45] [45]

X. Zhai, B. Ou, Y . Wang, H. Y . Leong, Q. Yu, C. Hao, and Y . Liu. Cofreevla: Collision-free dual-arm manipulation via vision-language-action model and risk estimation.arXiv preprint arXiv:2601.21712, 2026

arXiv 2026

[46] [46]

H. Li, Q. Feng, Z. Zheng, J. Feng, Z. Chen, and A. Knoll. Language-guided object-centric dif- fusion policy for generalizable and collision-aware manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 12834–12841. IEEE, 2025

2025

[47] [47]

H. Deng, W. Guo, Q. Wang, Z. Wu, and Z. Wang. Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025. 11

arXiv 2025

[48] [48]

Dastider, H

A. Dastider, H. Fang, and M. Lin. Apex: Ambidextrous dual-arm robotic manipulation using collision-free generative diffusion models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9526–9533. IEEE, 2024

2024

[49] [49]

K. Lv, M. Yu, Y . Jia, C. Zhang, and X. Li. Kinematics-aware diffusion policy with consistent 3d observation and action space for whole-arm robotic manipulation.IEEE Robotics and Automation Letters, 2026. doi:10.1109/LRA.2026.3685437

work page doi:10.1109/lra.2026.3685437 2026

[50] [50]

Q. Lv, H. Li, X. Deng, R. Shao, Y . Li, J. Hao, L. Gao, M. Y . Wang, and L. Nie. Spatial- temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17394– 17404, 2025

2025

[51] [51]

K. Chen, Z. Bi, G. Zhao, C. Zheng, Y . Li, H. Zhao, and J. Ma. Samp: Spatial anchor-based motion policy for collision-aware robotic manipulators.arXiv preprint arXiv:2509.11185, 2025

arXiv 2025

[52] [52]

Fishman, A

A. Fishman, A. Walsman, M. Bhardwaj, W. Yuan, B. Sundaralingam, B. Boots, and D. Fox. Avoid everything: Model-free collision avoidance with expert-guided fine-tuning. InCoRL Workshop on Safe and Robust Robot Learning for Operation in the Real World, 2024

2024

[53] [53]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019

2019

[54] [54]

H. Ma, S. Bodmer, A. Carron, M. Zeilinger, and M. Muehlebach. Constraint-aware diffusion guidance for robotics: Real-time obstacle avoidance for autonomous racing.arXiv preprint arXiv:2505.13131, 2025

arXiv 2025

[55] [55]

S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems,...

2025