arxiv: 2604.08508 · v2 · submitted 2026-04-09 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation

John Z. Zhang , Maks Sorokin , Jan Br\"udigam , Brandon Hung , Stephen Phillips , Dmitry Yershov , Farzad Niroui , Tong Zhao

show 9 more authors

Leonor Fermoselle Xinghao Zhu Chao Cao Duy Ta Tao Pang Jiuguang Wang Preston Culbertson Zachary Manchester Simon Le Cl\'eac'h

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3

classification 💻 cs.RO

keywords loco-manipulationwhole-body controllegged robotstest-time planningsample-based plannersim-to-realdynamic manipulationquadruped

0 comments

The pith

Steering a pre-trained whole-body control policy with a sample-based planner at test time lets legged robots manipulate large unseen objects dynamically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that legged robots can handle dynamic loco-manipulation of heavy and oversized objects by using a sample-based planner to steer a pre-trained whole-body control policy during operation. This combination generalizes across varied objects and tasks with no retraining or retuning required, as verified on a real quadruped performing tasks like uprighting a tire heavier than the robot and dragging a large barrier. The same steering approach extends to humanoid robots for actions such as door opening and table pushing in simulation. The planner supports further adaptation through on-the-fly changes to its cost function. A sympathetic reader would care because the method separates policy training from task-specific execution, potentially making versatile whole-body behaviors practical without exhaustive per-task learning.

Core claim

By performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, legged robots can solve a variety of dynamic loco-manipulation tasks. The approach generalizes to a diverse set of objects and tasks with no additional tuning or training and can be further enhanced by flexibly adjusting the cost function at test time. Real-world demonstrations on a quadruped include uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself, while the method also applies to humanoid loco-manipulation tasks such as opening a door and pushing a table in simulation.

What carries the argument

Test-time steering of a pre-trained whole-body control policy by a sample-based planner, which generates action sequences to guide the policy toward successful contact-rich loco-manipulation outcomes.

If this is right

Robots can manipulate objects that exceed their nominal lifting or pushing capacity.
The same pre-trained policy works on new objects and tasks without retraining.
Cost-function adjustments at runtime provide task-specific flexibility.
The sim-to-real pipeline succeeds for contact-rich dynamic behaviors.
The steering technique applies across both quadrupedal and humanoid platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separating learned policy skills from runtime planning could lower the data volume needed to achieve broad robot competence.
The same pre-trained policy might support a wider range of robot body types if the planner accounts for kinematic differences.
Integrating real-time perception with the planner could enable fully autonomous operation on novel objects in unstructured settings.
Similar test-time steering might compose basic skills into longer manipulation sequences without additional policy training.

Load-bearing premise

The pre-trained whole-body policy already encodes sufficient dynamics and contact behaviors so that test-time planning can reliably steer it to success on unseen objects without model mismatch or instability in real-world execution.

What would settle it

Apply the steered policy to a new object with substantially different mass distribution, geometry, or surface friction in a real-world trial and observe whether the robot completes the loco-manipulation task or instead exhibits instability or failure to make progress.

Figures

Figures reproduced from arXiv: 2604.08508 by Brandon Hung, Chao Cao, Dmitry Yershov, Duy Ta, Farzad Niroui, Jan Br\"udigam, Jiuguang Wang, John Z. Zhang, Leonor Fermoselle, Maks Sorokin, Preston Culbertson, Simon Le Cl\'eac'h, Stephen Phillips, Tao Pang, Tong Zhao, Xinghao Zhu, Zachary Manchester.

**Figure 1.** Figure 1: First row from left to right: Spot robot uprights a tire, crowd-control barrier, traffic cone, and chair. Second row from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: System overview: Our method takes a hierarchical [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrations comparing (a) standard dynamics rollouts [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparing Sumo (ours, yellow) to end-to-end RL [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Top: comparison of Sumo (yellow, ours), E2E RL [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Maximum success rate vs compute time comparison [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Freeze-frame sequences showing Spot robot task progressions: (a) uprighting a tire, (b) uprighting a crowd control [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Freeze-frame sequences showing G1 humanoid task [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

This paper presents a sim-to-real approach that enables legged robots to dynamically manipulate large and heavy objects with whole-body dexterity. Our key insight is that by performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, we can enable these robots to solve a variety of dynamic loco-manipulation tasks. Interestingly, we find our method generalizes to a diverse set of objects and tasks with no additional tuning or training, and can be further enhanced by flexibly adjusting the cost function at test time. We demonstrate the capabilities of our approach through a variety of challenging loco-manipulation tasks on a Spot quadruped robot in the real world, including uprighting a tire heavier than the robot's nominal lifting capacity and dragging a crowd-control barrier larger and taller than the robot itself. Additionally, we show that the same approach can be generalized to humanoid loco-manipulation tasks, such as opening a door and pushing a table, in simulation. Project code and videos are available at https://sumo.rai-inst.com/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Steering a fixed pre-trained whole-body policy with test-time sample-based planning produces useful real-world demos on a quadruped, but the no-tuning generalization to diverse objects rests on an assumption that needs more testing.

read the letter

The punchline is that this paper gets real-world results on challenging loco-manipulation tasks by steering a fixed pre-trained policy with test-time sampling, but the broad generalization claim without tuning is only weakly supported so far. The new part is the specific combination of whole-body policy pre-training followed by online sample-based planning to handle varied objects and tasks. It does well in showing practical demos on the Spot robot, like uprighting an oversized tire and moving a large barrier. Those are meaningful because they go beyond what the robot could do with standard methods. The ability to adjust costs at test time without retraining adds flexibility, and the sim extension to humanoids suggests broader applicability. Having code and videos available is helpful for reproducibility. The soft spots are around the evidence for zero-shot generalization. The central assumption is that the pre-trained policy already encodes robust enough dynamics and contacts for the planner to steer successfully on unseen objects. In legged systems, small changes in object properties can affect stability, so without ablations on planner settings, quantitative success rates, or tests on more varied objects, it's difficult to gauge how reliable this is. The abstract mentions no additional tuning, but the demos are on specific cases, which leaves room for the stress-test concern about model mismatch. This paper is for robotics engineers and researchers focused on legged loco-manipulation and sim-to-real methods. A reader working on practical deployment for quadrupeds would find the approach and results useful to consider. It deserves a serious referee because the empirical demonstrations are relevant and the method is grounded enough to warrant detailed review, even if revisions for more analysis are likely needed. I would recommend sending it for peer review with attention to strengthening the evaluation of the generalization properties.

Referee Report

2 major / 2 minor

Summary. The paper presents SUMO, a sim-to-real approach for dynamic whole-body loco-manipulation on legged robots. The core idea is to steer a pre-trained whole-body control policy at test time using a sample-based planner, enabling tasks such as uprighting heavy tires and dragging large barriers on a Spot quadruped in the real world. The authors claim that this method generalizes to diverse objects and tasks without additional training or tuning, and can be enhanced by adjusting the cost function at test time. They also demonstrate extension to humanoid robots for tasks like door opening and table pushing in simulation.

Significance. If the generalization without tuning holds, this approach would be significant for robotics by allowing pre-trained policies to handle a range of loco-manipulation tasks through online planning rather than retraining. The real-world demonstrations on challenging physical tasks provide direct evidence of practical applicability, and the flexibility in cost function adjustment is a notable feature. The provision of project code and videos is a strength for reproducibility.

major comments (2)

[Abstract] Abstract: The generalization claim ('generalizes to a diverse set of objects and tasks with no additional tuning or training') is not accompanied by quantitative metrics, success rates over multiple trials, ablation studies on planner parameters, or analysis of failure cases. This makes it challenging to assess the reliability of the test-time steering for out-of-distribution objects beyond the two qualitative real-world examples.
[Method/Experiments] Method/Experiments: The central claim requires that the pre-trained policy has internalized sufficiently accurate contact/friction models so that sample-based rollouts can steer it to success on unseen objects (e.g., tire exceeding nominal lift capacity) without instability. No sensitivity analysis, model-mismatch experiments, or OOD testing under variations in mass distribution or surface properties is reported, leaving the no-tuning generalization dependent on an unverified transfer property.

minor comments (2)

[Abstract] The abstract refers to 'a variety of challenging loco-manipulation tasks' but provides details on only two real-world examples; listing the full set of evaluated tasks with brief outcomes would improve clarity.
Ensure that all statements about adjustable cost functions at test time include explicit references to the corresponding planner implementation details and any associated hyperparameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance and reproducibility. We address each major comment point-by-point below, providing clarifications and indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The generalization claim ('generalizes to a diverse set of objects and tasks with no additional tuning or training') is not accompanied by quantitative metrics, success rates over multiple trials, ablation studies on planner parameters, or analysis of failure cases. This makes it challenging to assess the reliability of the test-time steering for out-of-distribution objects beyond the two qualitative real-world examples.

Authors: We agree that quantitative metrics would provide stronger support for the generalization claim in the abstract. The manuscript presents qualitative real-world demonstrations on two challenging tasks (uprighting a heavy tire and dragging a large barrier) plus simulated humanoid extensions to illustrate flexibility without retraining. In the revised manuscript, we have added success rates over multiple trials for the primary real-world tasks, an analysis of observed failure cases, and ablations on planner parameters (number of samples and planning horizon) in the supplementary material. The abstract has been updated to reference these additions. revision: yes
Referee: [Method/Experiments] Method/Experiments: The central claim requires that the pre-trained policy has internalized sufficiently accurate contact/friction models so that sample-based rollouts can steer it to success on unseen objects (e.g., tire exceeding nominal lift capacity) without instability. No sensitivity analysis, model-mismatch experiments, or OOD testing under variations in mass distribution or surface properties is reported, leaving the no-tuning generalization dependent on an unverified transfer property.

Authors: The referee correctly identifies that the approach depends on the pre-trained policy having learned sufficiently accurate contact and friction behaviors from simulation. The policy is trained in a physics-based simulator that models these dynamics, and the sample-based planner performs rollouts using the identical simulator and policy. The successful sim-to-real transfer on objects exceeding nominal capacities provides supporting evidence, but we acknowledge the absence of explicit sensitivity analyses or model-mismatch experiments in the original submission. The revised manuscript includes an expanded discussion section that explains the simulator's contact model assumptions and the boundaries of the observed generalization. Comprehensive OOD testing with controlled variations in mass distribution and surface properties was not performed and would require additional hardware setups. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with external real-world validation

full rationale

The paper presents an empirical sim-to-real method: a pre-trained whole-body policy is steered at test time by a sample-based planner, with optional cost-function adjustment. Generalization to unseen objects and tasks is asserted via real-world demonstrations on a Spot robot (e.g., tire uprighting, barrier dragging) and simulated humanoid tasks, without any reported equations, fitted parameters, or derivations that reduce the claimed outcomes to quantities defined by the same evaluation data. No self-citation chains, self-definitional loops, or renamed known results appear in the provided text; the pre-training step is treated as an external input whose dynamics are assumed sufficient but are not derived within the paper itself. This is the common case of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a single pre-trained policy plus planner can handle contact-rich dynamics for unseen objects; no new physical entities are introduced, but the method implicitly relies on simulation fidelity and real-time planning assumptions.

free parameters (1)

planner cost function weights
Explicitly adjustable at test time for different tasks, not fitted to evaluation data.

axioms (2)

domain assumption The pre-trained policy captures transferable whole-body dynamics sufficient for steering on novel objects
Invoked to justify zero-shot generalization without retraining.
domain assumption Simulation-to-real transfer gap is small enough that test-time planning succeeds in hardware
Required for the real-world Spot demonstrations to validate the method.

pith-pipeline@v0.9.0 · 5541 in / 1411 out tokens · 75676 ms · 2026-05-10T17:32:42.905911+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our key insight is that by performing test-time steering of a pre-trained whole-body control policy with a sample-based planner, we can enable these robots to solve a variety of dynamic loco-manipulation tasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical framework that combines a pre-trained generalist whole-body control policy and test-time planning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Safe multi-agent navigation guided by goal- conditioned safe reinforcement learning

Juan Alvarez-Padilla, John Z Zhang, Sofia Kwok, John M Dolan, and Zachary Manchester. Real-time whole-body control of legged robots with model-predictive path inte- gral control. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14721–14727. IEEE, 2025. doi: 10.1109/ICRA55743.2025.11128271

work page doi:10.1109/icra55743.2025.11128271 2025
[2]

Hind- sight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hind- sight experience replay. InAdvances in Neural Informa- tion Processing Systems, volume 30, pages 5048–5058, 2017

2017
[3]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

Philip Arm, Mayank Mittal, Hendrik Kolvenbach, and Marco Hutter. Pedipulate: Enabling manipulation skills using a quadruped robot’s leg. In2024 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 5717–5723, 2024. doi: 10.1109/ICRA57147.2024. 10611307

work page doi:10.1109/icra57147.2024 2024
[4]

Powell, Benjamin Katz, Jared Di Carlo, Patrick M

Gerardo Bledt, Matthew J. Powell, Benjamin Katz, Jared Di Carlo, Patrick M. Wensing, and Sangbae Kim. Mit cheetah 3: Design and control of a robust, dynamic quadruped robot.IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2245–2252,
[5]

doi: 10.1109/IROS.2018.8593885

work page doi:10.1109/iros.2018.8593885 2018
[6]

Rambo: Rl-augmented model- based whole-body control for loco-manipulation.IEEE Robotics and Automation Letters, 10(9):9462–9469,

Jin Cheng, Dongho Kang, Gabriele Fadini, Guanya Shi, and Stelian Coros. Rambo: Rl-augmented model- based whole-body control for loco-manipulation.IEEE Robotics and Automation Letters, 10(9):9462–9469,
[7]

doi: 10.1109/LRA.2025.3594984

work page doi:10.1109/lra.2025.3594984 2025
[8]

Legs as manipulator: Pushing quadrupedal agility beyond lo- comotion

Xuxin Cheng, Ashish Kumar, and Deepak Pathak. Legs as manipulator: Pushing quadrupedal agility beyond lo- comotion. In2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

2023
[9]

In: IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024

Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak Pathak. Extreme parkour with legged robots.IEEE International Conference on Robotics and Automation (ICRA), pages 11443–11450, 2024. doi: 10.1109/ ICRA57147.2024.10610200

work page arXiv 2024
[10]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

2024
[11]

Fast Contact-Implicit Model- Predictive Control.IEEE Transactions on Robotics and Automation, January 2024

Simon Le Cleac’h, Taylor Howell, Shuo Yang, Chi- Yen Lee, John Zhang, Arun Bishop, Mac Schwager, and Zachary Manchester. Fast Contact-Implicit Model- Predictive Control.IEEE Transactions on Robotics and Automation, January 2024. doi: 10.48550/arXiv.2107. 05616

work page doi:10.48550/arxiv.2107 2024
[12]

Mujoco warp: Gpu- optimized version of the mujoco physics simulator

Google DeepMind and NVIDIA. Mujoco warp: Gpu- optimized version of the mujoco physics simulator. https: //github.com/google-deepmind/mujoco warp, 2025. Ac- cessed: 2025-11-19

2025
[13]

Perceptive locomo- tion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023

Ruben Grandia, Fabian Jenelten, Shaohui Yang, Far- bod Farshidian, and Marco Hutter. Perceptive locomo- tion through nonlinear model-predictive control.IEEE Transactions on Robotics, 39(5):3402–3421, 2023. doi: 10.1109/TRO.2023.3275384

work page doi:10.1109/tro.2023.3275384 2023
[14]

2016, arXiv e-prints, arXiv:1604.00772, doi: 10.48550/arXiv.1604.00772

Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016

work page arXiv 2016
[15]

Predictive sampling: Real-time behaviour synthesis with mujoco,

Taylor Howell, Nimrod Gileadi, Saran Tunyasuvunakool, Kevin Zakka, Tom Erez, and Yuval Tassa. Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo. arXiv preprint arXiv:2212.00541, 2022

work page arXiv 2022
[16]

Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter

Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario C. Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26), 2019. doi: 10.1126/scirobotics.aau5872

work page doi:10.1126/scirobotics.aau5872 2019
[17]

D. H. Jacobson and D. Q. Mayne.Differential Dynamic Programming. Elsevier, 1970

1970
[18]

A smooth sea never made a skilled sailor: Robust imitation via learning to search.arXiv preprint arXiv:2506.05294,

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choud- hury, and Gokul Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search. arXiv preprint arXiv:2506.05294, 2025. URL https: //arxiv.org/pdf/2506.05294

work page arXiv 2025
[19]

Planning with diffusion for flexible behavior synthesis

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, 2022

2022
[20]

Dtc: Deep tracking control.Sci- ence Robotics, 9(86):eadh5401, 2024

Fabian Jenelten, Junzhe He, Farbod Farshidian, and Marco Hutter. Dtc: Deep tracking control.Sci- ence Robotics, 9(86):eadh5401, 2024. doi: 10.1126/ scirobotics.adh5401

2024
[21]

Residual mpc: Blending reinforcement learning with gpu-parallelized model predictive control

Se Hwan Jeon, Ho Jae Lee, Seungwoo Hong, and Sangbae Kim. Residual mpc: Blending reinforcement learning with gpu-parallelized model predictive control. arXiv preprint arXiv:2510.12717, 2025. URL https: //arxiv.org/abs/2510.12717

work page arXiv 2025
[22]

SIAM Review , year =

Matthew Kelly. An introduction to trajectory optimiza- tion: How to do your own direct collocation.SIAM Re- view, 59(4):849–904, 2017. doi: 10.1137/16M1062569

work page doi:10.1137/16m1062569 2017
[23]

Li, Preston Culbertson, Vince Kurtz, and Aaron D

Albert H. Li, Preston Culbertson, Vince Kurtz, and Aaron D. Ames. Drop: Dexterous reorientation via online planning.arXiv preprint arXiv:2409.14562, 2024. URL https://arxiv.org/abs/2409.14562

work page arXiv 2024
[24]

Li, Brandon Hung, Aaron D

Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac’h, and Preston Culbertson. Judo: A user-friendly open-source package for sampling-based model predictive control. InProceedings of the Workshop on Fast Motion Planning and Control in the Era of Parallelism at Robotics: Science and Systems (RSS),
[25]

URL https://github.com/bdaiinstitute/judo
[26]

Iterative linear quadratic regulator design for nonlinear biological move- ment systems

Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological move- ment systems. InProceedings of the First Inter- national Conference on Informatics in Control, Au- tomation and Robotics (ICINCO), volume 1, pages 222–229. INSTICC, SciTePress, 2004. doi: 10.5220/ 0001143902220229

2004
[27]

Omnitrack: General motion track- ing via physics-consistent reference.arXiv preprint arXiv:2602.23832, 2026

Yuhan Li, Peiyuan Zhi, Yunshen Wang, Tengyu Liu, Sixu Yan, Wenyu Liu, Xinggang Wang, Baoxiong Jia, and Siyuan Huang. Omnitrack: General motion track- ing via physics-consistent reference.arXiv preprint arXiv:2602.23832, 2026. URL https://arxiv.org/pdf/2602. 23832

work page arXiv 2026
[28]

Beyondmimic: From mo- tion tracking to versatile humanoid control via guided diffusion,

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yu- man Gao, Guy Tevet, Koushil Sreenath, and C. Karen Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

work page arXiv 2025
[29]

Guoqing Ma, Siheng Wang, Zeyu Zhang, Shan Yu, and Hao Tang

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Casta ˜neda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi ”Jim” Fan, and Yuke Zhu. Sonic: Supersizing motion tracking fo...

work page arXiv 2025
[30]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023

work page internal anchor Pith review arXiv 2023
[31]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, 2023

Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Au- tomation Letters, 8(6):3740–3747, 2023. do...

work page arXiv 2023
[32]

mjlab: Isaac lab api, powered by mujoco-warp, for rl and robotics research

MuJoCo Lab Contributors. mjlab: Isaac lab api, powered by mujoco-warp, for rl and robotics research. https:// github.com/mujocolab/mjlab, 2025. Accessed: 2025-11- 20

2025
[33]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024. URL https://arxiv.org/ abs/2410.13816

work page arXiv 2024
[34]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287, 1999

1999
[35]

Model-based diffusion for trajectory optimization

Chaoyi Pan, Zeji Yi, Guanya Shi, and Guannan Qu. Model-based diffusion for trajectory optimization. 2024. URL https://arxiv.org/abs/2407.01573

work page arXiv 2024
[36]

From simple to complex skills: The case of in-hand object reorientation.arXiv preprint arXiv:2501.05439, 2025

Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation.arXiv preprint arXiv:2501.05439, 2025

work page arXiv 2025
[37]

Zeroth-order optimization is se- cretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu. Zeroth-order optimization is se- cretly single-step policy optimization.arXiv preprint arXiv:2506.14460, 2025

work page arXiv 2025
[38]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Humanoidbench: Simulated humanoid benchmark for whole-body locomo- tion and manipulation

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomo- tion and manipulation. 2024

2024
[40]

Versatile multicontact planning and control for legged loco-manipulation.Science Robotics, 8(81):eadg5014,

Jean-Pierre Sleiman, Farbod Farshidian, and Marco Hut- ter. Versatile multicontact planning and control for legged loco-manipulation.Science Robotics, 8(81):eadg5014,
[41]

doi: 10.1126/scirobotics.adg5014

work page doi:10.1126/scirobotics.adg5014
[42]

Terry Suh, Tao Pang, Tong Zhao, and Russ Tedrake

H.J. Terry Suh, Tao Pang, Tong Zhao, and Russ Tedrake. Dexterous contact-rich manipulation via the contact trust region.arXiv preprint arXiv:2505.02291, 2025

work page arXiv 2025
[43]

Domain random- ization for transferring deep neural networks from simu- lation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain random- ization for transferring deep neural networks from simu- lation to the real world. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017

2017
[44]

Mujoco: A physics engine for model-based control.IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 5026–5033, 2012. doi: 10.1109/IROS.2012. 6386109

work page doi:10.1109/iros.2012 2012
[45]

Zhang, Jon Arrizabalaga, Stefan Schaal, Yuval Tassa, Tom Erez, and Zachary Manch- ester

Kevin Tracy, John Z. Zhang, Jon Arrizabalaga, Stefan Schaal, Yuval Tassa, Tom Erez, and Zachary Manch- ester. The trajectory bundle method: Unifying sequential- convex programming and sampling-based trajectory op- timization.arXiv preprint arXiv:2509.26575, 2025

work page arXiv 2025
[46]

Rehg, and Evangelos A

Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive driv- ing with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440, May 2016. doi: 10.1109/ICRA.2016.7487277

work page doi:10.1109/icra.2016.7487277 2016
[47]

Rehg, and Evangelos A

Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Information- Theoretic Model Predictive Control: Theory and Appli- cations to Autonomous Driving.IEEE Transactions on Robotics, 34(6):1603–1622, December 2018. ISSN 1941-

2018
[48]

doi: 10.1109/TRO.2018.2865891

work page doi:10.1109/tro.2018.2865891 2018
[49]

Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, and Guanya Shi. Omniretarget: Interaction- preserving data generation for humanoid whole-body loco-manipulation and scene interaction. 2025. URL https://arxiv.org/abs/2509.26633

work page arXiv 2025
[50]

Cajun: Continuous adaptive jumping using a learned centroidal controller.arXiv preprint arXiv:2306.09557, 2023

Yuxiang Yang, Guanya Shi, Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. Cajun: Continuous adaptive jumping using a learned centroidal controller.arXiv preprint arXiv:2306.09557, 2023. URL https://arxiv.org/abs/2306.09557

work page arXiv 2023
[51]

Language to rewards for robotic skill synthesis

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. Language to rewards for robotic skill synthesis.Arxiv preprint arXiv:2306....

work page arXiv 2023
[52]

Zhang, Shuo Yang, Gengshan Yang, Arun L

John Z. Zhang, Shuo Yang, Gengshan Yang, Arun L. Bishop, Swaminathan Gurumurthy, Deva Ramanan, and Zachary Manchester. Slomo: A general system for legged robot motion imitation from casual videos. 2023. URL https://ieeexplore.ieee.org/abstract/document/10246373

work page arXiv 2023
[53]

Zhang, Taylor A

John Z. Zhang, Taylor A. Howell, Zeji Yi, Chaoyi Pan, Guanya Shi, Guannan Qu, Tom Erez, Yuval Tassa, and Zachary Manchester. Whole-body model-predictive control of legged robots with mujoco. 2025. URL https://arxiv.org/abs/2503.04613

work page arXiv 2025
[54]

Applications of the cross- entropy method to importance sampling and optimal con- trol of diffusions.SIAM Journal on Scientific Computing, 36(6):A2654–A2672, 2014

Wei Zhang, Han Wang, Carsten Hartmann, Marcus We- ber, and Christof Sch ¨utte. Applications of the cross- entropy method to importance sampling and optimal con- trol of diffusions.SIAM Journal on Scientific Computing, 36(6):A2654–A2672, 2014. doi: 10.1137/14096493X. URL https://doi.org/10.1137/14096493X

work page doi:10.1137/14096493x 2014
[55]

Relic: Versatile loco-manipulation through flexible in- terlimb coordination.arXiv preprint arXiv:2506.07876, 2025

Xinghao Zhu, Yuxin Chen, Lingfeng Sun, Farzad Niroui, Simon Le Cleac’h, Jiuguang Wang, and Kuan Fang. Relic: Versatile loco-manipulation through flexible in- terlimb coordination.arXiv preprint arXiv:2506.07876, 2025. APPENDIX A. Task Rewards Here, we provide a detailed description of the task rewards from the demonstration section as deployed on the Spot...

work page arXiv 2025
[56]

The desired positionsp des are computed dynamically based on the tire position, encouraging the robot to position itself around the tire for manipulation

Spot Tasks: Tire Upright:The tire upright task uses a cost function with proximity terms to guide the robot’s end-effectors toward the object, an orientation term to upright the tire, and regulariza- tion terms: JTire Upright =w orient ·exp (|y tire z|/σ) +w gripper · ∥pgripper −p des gripper∥2 +w foot ·min ∥pfr −p des fr ∥2,∥p fl −p des fl ∥2 +w torso · ...
[57]

The control cost penalizes base velocity and arm deviation from default poses

G1 Tasks: Box Pushing:The G1 box pushing task uses goal-reaching, orientation, and bimanual proximity terms: JG1 Box =w goal · ∥pbox −p goal∥2 +w orient · |1−y box ·z world| +w hand ·min (∥p left −p box∥,∥p right −p box∥) −w pelvis · ∥ppelvis −p box∥2 −w facing ·x robot ·x world +w ctrl ·(∥v base∥2 +∥q arm −q default arm ∥2) +J safety (14) wherey box is t...