pith. machine review for the scientific record. sign in

arxiv: 2604.08508 · v2 · submitted 2026-04-09 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3

classification 💻 cs.RO
keywords loco-manipulationwhole-body controllegged robotstest-time planningsample-based plannersim-to-realdynamic manipulationquadruped
0
0 comments X

The pith

Steering a pre-trained whole-body control policy with a sample-based planner at test time lets legged robots manipulate large unseen objects dynamically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that legged robots can handle dynamic loco-manipulation of heavy and oversized objects by using a sample-based planner to steer a pre-trained whole-body control policy during operation. This combination generalizes across varied objects and tasks with no retraining or retuning required, as verified on a real quadruped performing tasks like uprighting a tire heavier than the robot and dragging a large barrier. The same steering approach extends to humanoid robots for actions such as door opening and table pushing in simulation. The planner supports further adaptation through on-the-fly changes to its cost function. A sympathetic reader would care because the method separates policy training from task-specific execution, potentially making versatile whole-body behaviors practical without exhaustive per-task learning.

Core claim

By performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, legged robots can solve a variety of dynamic loco-manipulation tasks. The approach generalizes to a diverse set of objects and tasks with no additional tuning or training and can be further enhanced by flexibly adjusting the cost function at test time. Real-world demonstrations on a quadruped include uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself, while the method also applies to humanoid loco-manipulation tasks such as opening a door and pushing a table in simulation.

What carries the argument

Test-time steering of a pre-trained whole-body control policy by a sample-based planner, which generates action sequences to guide the policy toward successful contact-rich loco-manipulation outcomes.

If this is right

  • Robots can manipulate objects that exceed their nominal lifting or pushing capacity.
  • The same pre-trained policy works on new objects and tasks without retraining.
  • Cost-function adjustments at runtime provide task-specific flexibility.
  • The sim-to-real pipeline succeeds for contact-rich dynamic behaviors.
  • The steering technique applies across both quadrupedal and humanoid platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separating learned policy skills from runtime planning could lower the data volume needed to achieve broad robot competence.
  • The same pre-trained policy might support a wider range of robot body types if the planner accounts for kinematic differences.
  • Integrating real-time perception with the planner could enable fully autonomous operation on novel objects in unstructured settings.
  • Similar test-time steering might compose basic skills into longer manipulation sequences without additional policy training.

Load-bearing premise

The pre-trained whole-body policy already encodes sufficient dynamics and contact behaviors so that test-time planning can reliably steer it to success on unseen objects without model mismatch or instability in real-world execution.

What would settle it

Apply the steered policy to a new object with substantially different mass distribution, geometry, or surface friction in a real-world trial and observe whether the robot completes the loco-manipulation task or instead exhibits instability or failure to make progress.

Figures

Figures reproduced from arXiv: 2604.08508 by Brandon Hung, Chao Cao, Dmitry Yershov, Duy Ta, Farzad Niroui, Jan Br\"udigam, Jiuguang Wang, John Z. Zhang, Leonor Fermoselle, Maks Sorokin, Preston Culbertson, Simon Le Cl\'eac'h, Stephen Phillips, Tao Pang, Tong Zhao, Xinghao Zhu, Zachary Manchester.

Figure 1
Figure 1. Figure 1: First row from left to right: Spot robot uprights a tire, crowd-control barrier, traffic cone, and chair. Second row from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview: Our method takes a hierarchical [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations comparing (a) standard dynamics rollouts [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing Sumo (ours, yellow) to end-to-end RL [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top: comparison of Sumo (yellow, ours), E2E RL [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Maximum success rate vs compute time comparison [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Freeze-frame sequences showing Spot robot task progressions: (a) uprighting a tire, (b) uprighting a crowd control [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Freeze-frame sequences showing G1 humanoid task [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

This paper presents a sim-to-real approach that enables legged robots to dynamically manipulate large and heavy objects with whole-body dexterity. Our key insight is that by performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, we can enable these robots to solve a variety of dynamic loco-manipulation tasks. Interestingly, we find our method generalizes to a diverse set of objects and tasks with no additional tuning or training, and can be further enhanced by flexibly adjusting the cost function at test time. We demonstrate the capabilities of our approach through a variety of challenging loco-manipulation tasks on a Spot quadruped robot in the real world, including uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself. Additionally, we show that the same approach can be generalized to humanoid loco-manipulation tasks, such as opening a door and pushing a table, in simulation. Project code and videos are available at https://sumo.rai-inst.com/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents SUMO, a sim-to-real approach for dynamic whole-body loco-manipulation on legged robots. The core idea is to steer a pre-trained whole-body control policy at test time using a sample-based planner, enabling tasks such as uprighting heavy tires and dragging large barriers on a Spot quadruped in the real world. The authors claim that this method generalizes to diverse objects and tasks without additional training or tuning, and can be enhanced by adjusting the cost function at test time. They also demonstrate extension to humanoid robots for tasks like door opening and table pushing in simulation.

Significance. If the generalization without tuning holds, this approach would be significant for robotics by allowing pre-trained policies to handle a range of loco-manipulation tasks through online planning rather than retraining. The real-world demonstrations on challenging physical tasks provide direct evidence of practical applicability, and the flexibility in cost function adjustment is a notable feature. The provision of project code and videos is a strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: The generalization claim ('generalizes to a diverse set of objects and tasks with no additional tuning or training') is not accompanied by quantitative metrics, success rates over multiple trials, ablation studies on planner parameters, or analysis of failure cases. This makes it challenging to assess the reliability of the test-time steering for out-of-distribution objects beyond the two qualitative real-world examples.
  2. [Method/Experiments] Method/Experiments: The central claim requires that the pre-trained policy has internalized sufficiently accurate contact/friction models so that sample-based rollouts can steer it to success on unseen objects (e.g., tire exceeding nominal lift capacity) without instability. No sensitivity analysis, model-mismatch experiments, or OOD testing under variations in mass distribution or surface properties is reported, leaving the no-tuning generalization dependent on an unverified transfer property.
minor comments (2)
  1. [Abstract] The abstract refers to 'a variety of challenging loco-manipulation tasks' but provides details on only two real-world examples; listing the full set of evaluated tasks with brief outcomes would improve clarity.
  2. Ensure that all statements about adjustable cost functions at test time include explicit references to the corresponding planner implementation details and any associated hyperparameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance and reproducibility. We address each major comment point-by-point below, providing clarifications and indicating where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The generalization claim ('generalizes to a diverse set of objects and tasks with no additional tuning or training') is not accompanied by quantitative metrics, success rates over multiple trials, ablation studies on planner parameters, or analysis of failure cases. This makes it challenging to assess the reliability of the test-time steering for out-of-distribution objects beyond the two qualitative real-world examples.

    Authors: We agree that quantitative metrics would provide stronger support for the generalization claim in the abstract. The manuscript presents qualitative real-world demonstrations on two challenging tasks (uprighting a heavy tire and dragging a large barrier) plus simulated humanoid extensions to illustrate flexibility without retraining. In the revised manuscript, we have added success rates over multiple trials for the primary real-world tasks, an analysis of observed failure cases, and ablations on planner parameters (number of samples and planning horizon) in the supplementary material. The abstract has been updated to reference these additions. revision: yes

  2. Referee: [Method/Experiments] Method/Experiments: The central claim requires that the pre-trained policy has internalized sufficiently accurate contact/friction models so that sample-based rollouts can steer it to success on unseen objects (e.g., tire exceeding nominal lift capacity) without instability. No sensitivity analysis, model-mismatch experiments, or OOD testing under variations in mass distribution or surface properties is reported, leaving the no-tuning generalization dependent on an unverified transfer property.

    Authors: The referee correctly identifies that the approach depends on the pre-trained policy having learned sufficiently accurate contact and friction behaviors from simulation. The policy is trained in a physics-based simulator that models these dynamics, and the sample-based planner performs rollouts using the identical simulator and policy. The successful sim-to-real transfer on objects exceeding nominal capacities provides supporting evidence, but we acknowledge the absence of explicit sensitivity analyses or model-mismatch experiments in the original submission. The revised manuscript includes an expanded discussion section that explains the simulator's contact model assumptions and the boundaries of the observed generalization. Comprehensive OOD testing with controlled variations in mass distribution and surface properties was not performed and would require additional hardware setups. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external real-world validation

full rationale

The paper presents an empirical sim-to-real method: a pre-trained whole-body policy is steered at test time by a sample-based planner, with optional cost-function adjustment. Generalization to unseen objects and tasks is asserted via real-world demonstrations on a Spot robot (e.g., tire uprighting, barrier dragging) and simulated humanoid tasks, without any reported equations, fitted parameters, or derivations that reduce the claimed outcomes to quantities defined by the same evaluation data. No self-citation chains, self-definitional loops, or renamed known results appear in the provided text; the pre-training step is treated as an external input whose dynamics are assumed sufficient but are not derived within the paper itself. This is the common case of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a single pre-trained policy plus planner can handle contact-rich dynamics for unseen objects; no new physical entities are introduced, but the method implicitly relies on simulation fidelity and real-time planning assumptions.

free parameters (1)
  • planner cost function weights
    Explicitly adjustable at test time for different tasks, not fitted to evaluation data.
axioms (2)
  • domain assumption The pre-trained policy captures transferable whole-body dynamics sufficient for steering on novel objects
    Invoked to justify zero-shot generalization without retraining.
  • domain assumption Simulation-to-real transfer gap is small enough that test-time planning succeeds in hardware
    Required for the real-world Spot demonstrations to validate the method.

pith-pipeline@v0.9.0 · 5541 in / 1411 out tokens · 75676 ms · 2026-05-10T17:32:42.905911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 37 canonical work pages · 2 internal anchors

  1. [1]

    Safe multi-agent navigation guided by goal- conditioned safe reinforcement learning

    Juan Alvarez-Padilla, John Z Zhang, Sofia Kwok, John M Dolan, and Zachary Manchester. Real-time whole-body control of legged robots with model-predictive path inte- gral control. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14721–14727. IEEE, 2025. doi: 10.1109/ICRA55743.2025.11128271

  2. [2]

    Hind- sight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. InAdvances in Neural Informa- tion Processing Systems, volume 30, pages 5048–5058, 2017

  3. [3]

    In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

    Philip Arm, Mayank Mittal, Hendrik Kolvenbach, and Marco Hutter. Pedipulate: Enabling manipulation skills using a quadruped robot’s leg. In2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5717–5723, 2024. doi: 10.1109/ICRA57147.2024. 10611307

  4. [4]

    Powell, Benjamin Katz, Jared Di Carlo, Patrick M

    Gerardo Bledt, Matthew J. Powell, Benjamin Katz, Jared Di Carlo, Patrick M. Wensing, and Sangbae Kim. Mit cheetah 3: Design and control of a robust, dynamic quadruped robot.IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2245–2252,

  5. [5]

    doi: 10.1109/IROS.2018.8593885

  6. [6]

    Rambo: Rl-augmented model- based whole-body control for loco-manipulation.IEEE Robotics and Automation Letters, 10(9):9462–9469,

    Jin Cheng, Dongho Kang, Gabriele Fadini, Guanya Shi, and Stelian Coros. Rambo: Rl-augmented model- based whole-body control for loco-manipulation.IEEE Robotics and Automation Letters, 10(9):9462–9469,

  7. [7]

    doi: 10.1109/LRA.2025.3594984

  8. [8]

    Legs as manipulator: Pushing quadrupedal agility beyond lo- comotion

    Xuxin Cheng, Ashish Kumar, and Deepak Pathak. Legs as manipulator: Pushing quadrupedal agility beyond lo- comotion. In2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

  9. [9]

    In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

    Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak Pathak. Extreme parkour with legged robots.IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450, 2024. doi: 10.1109/ ICRA57147.2024.10610200

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  11. [11]

    Fast Contact-Implicit Model- Predictive Control.IEEE Transactions on Robotics and Automation, January 2024

    Simon Le Cleac’h, Taylor Howell, Shuo Yang, Chi- Yen Lee, John Zhang, Arun Bishop, Mac Schwager, and Zachary Manchester. Fast Contact-Implicit Model- Predictive Control.IEEE Transactions on Robotics and Automation, January 2024. doi: 10.48550/arXiv.2107. 05616

  12. [12]

    Mujoco warp: Gpu- optimized version of the mujoco physics simulator

    Google DeepMind and NVIDIA. Mujoco warp: Gpu- optimized version of the mujoco physics simulator. https: //github.com/google-deepmind/mujoco warp, 2025. Ac- cessed: 2025-11-19

  13. [13]

    Perceptive locomo- tion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023

    Ruben Grandia, Fabian Jenelten, Shaohui Yang, Far- bod Farshidian, and Marco Hutter. Perceptive locomo- tion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023. doi: 10.1109/TRO.2023.3275384

  14. [14]

    2016, arXiv e-prints, arXiv:1604.00772, doi: 10.48550/arXiv.1604.00772

    Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016

  15. [15]

    Predictive sampling: Real-time behaviour synthesis with mujoco,

    Taylor Howell, Nimrod Gileadi, Saran Tunyasuvunakool, Kevin Zakka, Tom Erez, and Yuval Tassa. Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo. arXiv preprint arXiv:2212.00541, 2022

  16. [16]

    Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario C. Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26), 2019. doi: 10.1126/scirobotics.aau5872

  17. [17]

    D. H. Jacobson and D. Q. Mayne.Differential Dynamic Programming. Elsevier, 1970

  18. [18]

    A smooth sea never made a skilled sailor: Robust imitation via learning to search.arXiv preprint arXiv:2506.05294,

    Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choud- hury, and Gokul Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search. arXiv preprint arXiv:2506.05294, 2025. URL https: //arxiv.org/pdf/2506.05294

  19. [19]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, 2022

  20. [20]

    Dtc: Deep tracking control.Sci- ence Robotics, 9(86):eadh5401, 2024

    Fabian Jenelten, Junzhe He, Farbod Farshidian, and Marco Hutter. Dtc: Deep tracking control.Sci- ence Robotics, 9(86):eadh5401, 2024. doi: 10.1126/ scirobotics.adh5401

  21. [21]

    Residual mpc: Blending reinforcement learning with gpu-parallelized model predictive control

    Se Hwan Jeon, Ho Jae Lee, Seungwoo Hong, and Sangbae Kim. Residual mpc: Blending reinforcement learning with gpu-parallelized model predictive control. arXiv preprint arXiv:2510.12717, 2025. URL https: //arxiv.org/abs/2510.12717

  22. [22]

    SIAM Review , year =

    Matthew Kelly. An introduction to trajectory optimiza- tion: How to do your own direct collocation.SIAM Re- view, 59(4):849–904, 2017. doi: 10.1137/16M1062569

  23. [23]

    Li, Preston Culbertson, Vince Kurtz, and Aaron D

    Albert H. Li, Preston Culbertson, Vince Kurtz, and Aaron D. Ames. Drop: Dexterous reorientation via online planning.arXiv preprint arXiv:2409.14562, 2024. URL https://arxiv.org/abs/2409.14562

  24. [24]

    Li, Brandon Hung, Aaron D

    Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac’h, and Preston Culbertson. Judo: A user-friendly open-source package for sampling-based model predictive control. InProceedings of the Workshop on Fast Motion Planning and Control in the Era of Parallelism at Robotics: Science and Systems (RSS),

  25. [25]

    URL https://github.com/bdaiinstitute/judo

  26. [26]

    Iterative linear quadratic regulator design for nonlinear biological move- ment systems

    Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological move- ment systems. InProceedings of the First Inter- national Conference on Informatics in Control, Au- tomation and Robotics (ICINCO), volume 1, pages 222–229. INSTICC, SciTePress, 2004. doi: 10.5220/ 0001143902220229

  27. [27]

    Omnitrack: General motion track- ing via physics-consistent reference.arXiv preprint arXiv:2602.23832, 2026

    Yuhan Li, Peiyuan Zhi, Yunshen Wang, Tengyu Liu, Sixu Yan, Wenyu Liu, Xinggang Wang, Baoxiong Jia, and Siyuan Huang. Omnitrack: General motion track- ing via physics-consistent reference.arXiv preprint arXiv:2602.23832, 2026. URL https://arxiv.org/pdf/2602. 23832

  28. [28]

    Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

    Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yu- man Gao, Guy Tevet, Koushil Sreenath, and C. Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  29. [29]

    Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang

    Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Casta ˜neda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi ”Jim” Fan, and Yuke Zhu. Sonic: Supersizing motion tracking fo...

  30. [30]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023

  31. [31]

    Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, 2023

    Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, 2023. do...

  32. [32]

    mjlab: Isaac lab api, powered by mujoco-warp, for rl and robotics research

    MuJoCo Lab Contributors. mjlab: Isaac lab api, powered by mujoco-warp, for rl and robotics research. https:// github.com/mujocolab/mjlab, 2025. Accessed: 2025-11- 20

  33. [33]

    Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

    Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024. URL https://arxiv.org/ abs/2410.13816

  34. [34]

    Policy invariance under reward transformations: Theory and application to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287, 1999

  35. [35]

    Model-based diffusion for trajectory optimization

    Chaoyi Pan, Zeji Yi, Guanya Shi, and Guannan Qu. Model-based diffusion for trajectory optimization. 2024. URL https://arxiv.org/abs/2407.01573

  36. [36]

    From simple to complex skills: The case of in-hand object reorientation.arXiv preprint arXiv:2501.05439, 2025

    Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation.arXiv preprint arXiv:2501.05439, 2025

  37. [37]

    Zeroth-order optimization is se- cretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

    Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu. Zeroth-order optimization is se- cretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

  38. [38]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  39. [39]

    Humanoidbench: Simulated humanoid benchmark for whole-body locomo- tion and manipulation

    Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomo- tion and manipulation. 2024

  40. [40]

    Versatile multicontact planning and control for legged loco-manipulation.Science Robotics, 8(81):eadg5014,

    Jean-Pierre Sleiman, Farbod Farshidian, and Marco Hut- ter. Versatile multicontact planning and control for legged loco-manipulation.Science Robotics, 8(81):eadg5014,

  41. [41]

    doi: 10.1126/scirobotics.adg5014

  42. [42]

    Terry Suh, Tao Pang, Tong Zhao, and Russ Tedrake

    H.J. Terry Suh, Tao Pang, Tong Zhao, and Russ Tedrake. Dexterous contact-rich manipulation via the contact trust region.arXiv preprint arXiv:2505.02291, 2025

  43. [43]

    Domain random- ization for transferring deep neural networks from simu- lation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simu- lation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017

  44. [44]

    Mujoco: A physics engine for model-based control.IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012. doi: 10.1109/IROS.2012. 6386109

  45. [45]

    Zhang, Jon Arrizabalaga, Stefan Schaal, Yuval Tassa, Tom Erez, and Zachary Manch- ester

    Kevin Tracy, John Z. Zhang, Jon Arrizabalaga, Stefan Schaal, Yuval Tassa, Tom Erez, and Zachary Manch- ester. The trajectory bundle method: Unifying sequential- convex programming and sampling-based trajectory op- timization.arXiv preprint arXiv:2509.26575, 2025

  46. [46]

    Rehg, and Evangelos A

    Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive driv- ing with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440, May 2016. doi: 10.1109/ICRA.2016.7487277

  47. [47]

    Rehg, and Evangelos A

    Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Information- Theoretic Model Predictive Control: Theory and Appli- cations to Autonomous Driving.IEEE Transactions on Robotics, 34(6):1603–1622, December 2018. ISSN 1941-

  48. [48]

    doi: 10.1109/TRO.2018.2865891

  49. [49]

    Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,

    Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction- preserving data generation for humanoid whole-body loco-manipulation and scene interaction. 2025. URL https://arxiv.org/abs/2509.26633

  50. [50]

    Cajun: Continuous adaptive jumping using a learned centroidal controller.arXiv preprint arXiv:2306.09557, 2023

    Yuxiang Yang, Guanya Shi, Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. Cajun: Continuous adaptive jumping using a learned centroidal controller.arXiv preprint arXiv:2306.09557, 2023. URL https://arxiv.org/abs/2306.09557

  51. [51]

    Language to rewards for robotic skill synthesis

    Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. Language to rewards for robotic skill synthesis.Arxiv preprint arXiv:2306....

  52. [52]

    Zhang, Shuo Yang, Gengshan Yang, Arun L

    John Z. Zhang, Shuo Yang, Gengshan Yang, Arun L. Bishop, Swaminathan Gurumurthy, Deva Ramanan, and Zachary Manchester. Slomo: A general system for legged robot motion imitation from casual videos. 2023. URL https://ieeexplore.ieee.org/abstract/document/10246373

  53. [53]

    Zhang, Taylor A

    John Z. Zhang, Taylor A. Howell, Zeji Yi, Chaoyi Pan, Guanya Shi, Guannan Qu, Tom Erez, Yuval Tassa, and Zachary Manchester. Whole-body model-predictive control of legged robots with mujoco. 2025. URL https://arxiv.org/abs/2503.04613

  54. [54]

    Applications of the cross- entropy method to importance sampling and optimal con- trol of diffusions.SIAM Journal on Scientific Computing, 36(6):A2654–A2672, 2014

    Wei Zhang, Han Wang, Carsten Hartmann, Marcus We- ber, and Christof Sch ¨utte. Applications of the cross- entropy method to importance sampling and optimal con- trol of diffusions.SIAM Journal on Scientific Computing, 36(6):A2654–A2672, 2014. doi: 10.1137/14096493X. URL https://doi.org/10.1137/14096493X

  55. [55]

    Relic: Versatile loco-manipulation through flexible in- terlimb coordination.arXiv preprint arXiv:2506.07876, 2025

    Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac’h, Jiuguang Wang, and Kuan Fang. Relic: Versatile loco-manipulation through flexible in- terlimb coordination.arXiv preprint arXiv:2506.07876, 2025. APPENDIX A. Task Rewards Here, we provide a detailed description of the task rewards from the demonstration section as deployed on the Spot...

  56. [56]

    The desired positionsp des are computed dynamically based on the tire position, encouraging the robot to position itself around the tire for manipulation

    Spot Tasks: Tire Upright:The tire upright task uses a cost function with proximity terms to guide the robot’s end-effectors toward the object, an orientation term to upright the tire, and regulariza- tion terms: JTire Upright =w orient ·exp (|y tire z|/σ) +w gripper · ∥pgripper −p des gripper∥2 +w foot ·min ∥pfr −p des fr ∥2,∥p fl −p des fl ∥2 +w torso · ...

  57. [57]

    The control cost penalizes base velocity and arm deviation from default poses

    G1 Tasks: Box Pushing:The G1 box pushing task uses goal-reaching, orientation, and bimanual proximity terms: JG1 Box =w goal · ∥pbox −p goal∥2 +w orient · |1−y box ·z world| +w hand ·min (∥p left −p box∥,∥p right −p box∥) −w pelvis · ∥ppelvis −p box∥2 −w facing ·x robot ·x world +w ctrl ·(∥v base∥2 +∥q arm −q default arm ∥2) +J safety (14) wherey box is t...