pith. sign in

arxiv: 2606.05687 · v1 · pith:COIVWEKUnew · submitted 2026-06-04 · 💻 cs.RO · cs.SY· eess.SY

Accelerating and Scaling MPC-Guided Reinforcement Learning for Humanoid Locomotion and Manipulation

Pith reviewed 2026-06-28 01:34 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords MPC-RLcentroidal dynamicsreinforcement learninghumanoid locomotionmanipulationGPU MPC solvermodel predictive control
0
0 comments X

The pith

MPC-RL uses centroidal-dynamics trajectories to guide reinforcement learning and achieve better humanoid locomotion and manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MPC-RL, a method that incorporates guidance from model predictive control into reinforcement learning training for humanoid robots. It proposes a centroidal-dynamics based MPC reward to leverage trajectory information during training. To make this scalable, it develops a new parallel-in-horizon batched GPU solver called π^n MPC that works without construction or high memory use on time-varying dynamics. Comparative studies and hardware tests show this leads to superior performance in skills like walking and manipulation. A sympathetic reader would care because it makes the combination of MPC's physical grounding and RL's robustness practical for complex humanoid control.

Core claim

Through MPC-RL, the authors establish that using centroidal-dynamics MPC trajectories as guidance in the RL reward function, facilitated by an efficient parallel batched GPU MPC solver, results in policies with superior performance for humanoid locomotion and manipulation tasks compared to standard approaches.

What carries the argument

The π^n MPC solver, a parallel-in-horizon and construction-free batched GPU MPC solver operating directly on time-varying dynamics to provide guidance without high memory usage or pre-compilation.

If this is right

  • MPC-RL achieves superior performance in locomotion and manipulation skills.
  • The approach is practical for massively parallel RL training due to the efficient solver.
  • Hardware validations confirm the effectiveness of the learned policies.
  • The method avoids time-consuming problem construction and excessive training overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could extend to other complex robotic systems beyond humanoids if similar dynamics models are available.
  • Real-time deployment of the learned policies might benefit from the MPC guidance during training leading to more robust behaviors.
  • Future work could explore integrating this with other RL algorithms or dynamics models for further gains.

Load-bearing premise

The centroidal-dynamics MPC trajectories provide useful guidance that improves final policy performance without introducing prohibitive training overhead or requiring per-problem construction.

What would settle it

A direct comparison where MPC-RL policies fail to outperform baseline RL policies in both simulation benchmarks and physical hardware experiments for locomotion and manipulation would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2606.05687 by Aaron D. Ames, J\'an Drgo\v{n}a, Junheng Li, Liang Wu, Lizhi Yang, Sergio A. Esteban.

Figure 1
Figure 1. Figure 1: MPC-RL Experiment Snapshots. a). Walking on treadmill; b). Wearing an unknown 10 kg vest while walking; c). Push-recovery; d). Carrying an unknown 10 kg payload while walking; e). Pushing a 290kg cart. Full video: https://youtu.be/PrcbXkA1kYg Dai et al. [13] leverages analytical gait synthesis with CLF￾based reward structure for future foot placement guidance in highly unstructured terrain. Krishna et al. … view at source ↗
Figure 2
Figure 2. Figure 2: MPC-Guided Reinforcement Learning (MPC-RL) Architecture. During training, MPC generates per-environment predictive references through [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scalability of Batched GPU MPC Solvers. ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of locomotion velocity-tracking score metric vs. training steps and wall-clock time under different MPC-RL settings. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Velocity tracking comparison in box-pushing simulation. The robot [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $\pi^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at https://github.com/junhengl/mpc-rl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MPC-RL, a framework that incorporates centroidal-dynamics MPC trajectories as a training-time reward signal to guide RL policies for humanoid locomotion and manipulation. It introduces π^n MPC, a parallel-in-horizon and construction-free batched GPU solver that operates on time-varying dynamics to enable efficient massively parallel training. The central claim is that this combination yields superior performance over baselines, supported by comparative studies and hardware validations, with open-source code provided.

Significance. If the empirical results hold with proper controls, the work could offer a scalable way to inject physically grounded guidance into RL without prohibitive overhead, addressing a practical gap in humanoid control. The availability of the code base supports reproducibility and is a strength.

major comments (2)
  1. [Abstract] Abstract: the claim of 'superior performance' from 'comparative studies and hardware validations' is presented without any quantitative metrics, baseline descriptions, or statistical details. This absence prevents evaluation of the central empirical claim and must be addressed with specific results (e.g., success rates, tracking errors, or reward curves) in the main text.
  2. [Results / Experiments] The manuscript does not appear to contain an ablation that isolates the centroidal-dynamics MPC reward term on manipulation tasks (as opposed to the parallel solver itself). Without such an experiment (e.g., in the results section comparing MPC-RL against an RL-only variant with equivalent sample count), it remains unclear whether performance gains on contact-rich manipulation stem from trajectory quality or from increased parallelism.
minor comments (1)
  1. [Method] The π^n MPC solver description would benefit from a brief complexity or memory analysis to substantiate the 'avoid high memory usage' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'superior performance' from 'comparative studies and hardware validations' is presented without any quantitative metrics, baseline descriptions, or statistical details. This absence prevents evaluation of the central empirical claim and must be addressed with specific results (e.g., success rates, tracking errors, or reward curves) in the main text.

    Authors: The main text presents the quantitative results from comparative studies and hardware validations, including success rates, tracking errors, and reward curves in the Results section. To strengthen the abstract's claim, we will revise it to include key quantitative metrics drawn from those studies. revision: yes

  2. Referee: [Results / Experiments] The manuscript does not appear to contain an ablation that isolates the centroidal-dynamics MPC reward term on manipulation tasks (as opposed to the parallel solver itself). Without such an experiment (e.g., in the results section comparing MPC-RL against an RL-only variant with equivalent sample count), it remains unclear whether performance gains on contact-rich manipulation stem from trajectory quality or from increased parallelism.

    Authors: We agree that an explicit ablation isolating the MPC reward contribution on manipulation tasks would improve clarity. While our comparative studies include RL baselines, we will add a dedicated ablation comparing MPC-RL to an RL-only variant with matched sample counts on the manipulation tasks in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity detected; performance claims rest on external empirical comparisons

full rationale

The paper presents MPC-RL as an engineering combination of centroidal-dynamics MPC reward guidance and a new parallel batched GPU solver (π^nMPC). No equations, derivations, or self-citations are shown that reduce any claimed prediction or uniqueness result to a fitted input or prior author work by construction. The superiority claim is justified solely by comparative studies and hardware validations, which are independent of the method definition itself. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The centroidal-dynamics approximation is treated as a standard modeling choice rather than a new postulate.

pith-pipeline@v0.9.1-grok · 5724 in / 1010 out tokens · 19122 ms · 2026-06-28T01:34:23.045914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning,

    Z. Gu, J. Li, W. Shen, W. Yu, Z. Xie, S. McCrory, X. Cheng, A. Shamsah, R. Griffin, C. K. Liu,et al., “Humanoid locomotion and manipulation: Current progress and challenges in control, planning, and learning,”IEEE/ASME Transactions on Mechatronics, vol. 31, no. 2, pp. 2300–2330, 2026

  2. [2]

    Tailoring Solution Accuracy for Fast Whole-body Model Predictive Control of Legged Robots,

    C. Khazoom, S. Hong, M. Chignoli, E. Stanger-Jones, and S. Kim, “Tailoring Solution Accuracy for Fast Whole-body Model Predictive Control of Legged Robots,”arXiv preprint arXiv:2407.10789, 2024

  3. [3]

    Online non-linear centroidal mpc for humanoid robot locomotion with step adjustment,

    G. Romualdi, S. Dafarra, G. L’Erario, I. Sorrentino, S. Traversaro, and D. Pucci, “Online non-linear centroidal mpc for humanoid robot locomotion with step adjustment,” in2022 International Conference on Robotics and Automation (ICRA), pp. 10412–10419, IEEE, 2022

  4. [4]

    Optimization-based control for dynamic legged robots,

    P. M. Wensing, M. Posa, Y . Hu, A. Escande, N. Mansard, and A. Del Prete, “Optimization-based control for dynamic legged robots,” IEEE Transactions on Robotics, vol. 40, pp. 43–63, 2023

  5. [5]

    Reinforcement learning for robust parameterized locomotion control of bipedal robots,

    Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for robust parameterized locomotion control of bipedal robots,” in2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2811–2817, IEEE, 2021

  6. [6]

    Sim-to-real learning for humanoid box loco-manipulation,

    J. Dao, H. Duan, and A. Fern, “Sim-to-real learning for humanoid box loco-manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 16930–16936, IEEE, 2024

  7. [7]

    Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation,

    F. Liu, Z. Gu, Y . Cai, Z. Zhou, H. Jung, J. Jang, S. Zhao, S. Ha, Y . Chen, D. Xu,et al., “Opt2skill: Imitating dynamically-feasible whole-body trajectories for versatile humanoid loco-manipulation,” IEEE Robotics and Automation Letters, 2025

  8. [8]

    RL-augmented Adaptive Model Predictive Control for Bipedal Loco- motion over Challenging Terrain,

    J. Kamohara, F. Wu, C. Wamorkar, S. Hutchinson, and Y . Zhao, “RL-augmented Adaptive Model Predictive Control for Bipedal Loco- motion over Challenging Terrain,”arXiv preprint arXiv:2509.18466, 2025

  9. [9]

    Rl-augmented mpc framework for agile and robust bipedal footstep locomotion planning and con- trol,

    S. H. Bang, C. A. Jov ´e, and L. Sentis, “Rl-augmented mpc framework for agile and robust bipedal footstep locomotion planning and con- trol,” in2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), pp. 607–614, IEEE, 2024

  10. [10]

    Residual MPC: Blend- ing Reinforcement Learning with GPU-Parallelized Model Predictive Control,

    S. H. Jeon, H. J. Lee, S. Hong, and S. Kim, “Residual MPC: Blend- ing Reinforcement Learning with GPU-Parallelized Model Predictive Control,”arXiv preprint arXiv:2510.12717, 2025

  11. [11]

    Learning agile locomotion and adaptive behaviors via rl-augmented mpc,

    Y . Chen and Q. Nguyen, “Learning agile locomotion and adaptive behaviors via rl-augmented mpc,” in2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pp. 11436–11442, IEEE, 2024

  12. [12]

    Amo: Adaptive motion optimization for hyper-dexterous humanoid whole- body control,

    J. Li, X. Cheng, T. Huang, S. Yang, R.-Z. Qiu, and X. Wang, “Amo: Adaptive motion optimization for hyper-dexterous humanoid whole- body control,”arXiv preprint arXiv:2505.03738, 2025

  13. [13]

    Walk the PLANC: Physics-Guided RL for Agile Humanoid Locomotion on Constrained Footholds,

    M. Dai, W. D. Compton, J. Li, L. Yang, and A. D. Ames, “Walk the PLANC: Physics-Guided RL for Agile Humanoid Locomotion on Constrained Footholds,”arXiv preprint arXiv:2601.06286, 2026

  14. [14]

    Ogmp: Oracle guided multi-mode policies for agile and versatile robot control,

    L. Krishna, N. Sobanbabu, and Q. Nguyen, “Ogmp: Oracle guided multi-mode policies for agile and versatile robot control,”arXiv preprint arXiv:2403.04205, 2024

  15. [15]

    Actor-critic model predic- tive control,

    A. Romero, Y . Song, and D. Scaramuzza, “Actor-critic model predic- tive control,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 14777–14784, IEEE, 2024

  16. [16]

    Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control,

    C. Greatwood and A. G. Richards, “Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control,”Autonomous Robots, vol. 43, no. 7, pp. 1681–1693, 2019

  17. [17]

    Learning deep con- trol policies for autonomous aerial vehicles with mpc-guided policy search,

    T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep con- trol policies for autonomous aerial vehicles with mpc-guided policy search,” in2016 IEEE international conference on robotics and automation (ICRA), pp. 528–535, IEEE, 2016

  18. [18]

    Dtc: Deep tracking control,

    F. Jenelten, J. He, F. Farshidian, and M. Hutter, “Dtc: Deep tracking control,”Science Robotics, vol. 9, no. 86, p. eadh5401, 2024

  19. [19]

    Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,

    R. Reiter, J. Hoffmann, D. Reinhardt, F. Messerer, K. Baumg ¨artner, S. Sawant, J. Boedecker, M. Diehl, and S. Gros, “Synthesis of model predictive control and reinforcement learning: Survey and classifica- tion,”Annual Reviews in Control, vol. 61, p. 101045, 2026

  20. [20]

    Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,

    S. H. Jeon, S. Hong, H. J. Lee, C. Khazoom, and S. Kim, “Cusadi: A gpu parallelization framework for symbolic expressions and optimal control,”IEEE Robotics and Automation Letters, vol. 10, no. 2, pp. 899–906, 2024

  21. [21]

    Relu-qp: A gpu-accelerated quadratic programming solver for model-predictive control,

    A. L. Bishop, J. Z. Zhang, S. Gurumurthy, K. Tracy, and Z. Manch- ester, “Relu-qp: A gpu-accelerated quadratic programming solver for model-predictive control,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13285–13292, IEEE, 2024

  22. [22]

    Cost-Matching Model Predictive Control for Efficient Reinforcement Learning in Humanoid Locomotion,

    W. Cai, K. G. Vamvoudakis, S. Gros, and A. Tzes, “Cost-Matching Model Predictive Control for Efficient Reinforcement Learning in Humanoid Locomotion,”arXiv preprint arXiv:2603.28243, 2026

  23. [23]

    $\pi$MPC: A Parallel-in-horizon and Construction-free NMPC Solver

    L. Wu, B. Yang, J. Li, X. Yang, Y . Mo, Y . Shi, A. D. Ames, and J. Drgo ˇna, “πMPC: A Parallel-in-horizon and Construction-free NMPC Solver,”arXiv preprint arXiv:2601.14414v2, 2026

  24. [24]

    Centroidal dynamics of a humanoid robot,

    D. E. Orin, A. Goswami, and S.-H. Lee, “Centroidal dynamics of a humanoid robot,”Autonomous robots, vol. 35, no. 2, pp. 161–176, 2013

  25. [25]

    Stability of surface contacts for humanoid robots: Closed-form formulae of the contact wrench cone for rectangular support areas,

    S. Caron, Q.-C. Pham, and Y . Nakamura, “Stability of surface contacts for humanoid robots: Closed-form formulae of the contact wrench cone for rectangular support areas,” in2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 5107–5112, IEEE, 2015

  26. [26]

    Gait-Net-augmented Implicit Kino-dynamic MPC for Dynamic Variable-frequency Humanoid Lo- comotion over Discrete Terrains,

    J. Li, Z. Duan, J. Ma, and Q. Nguyen, “Gait-Net-augmented Implicit Kino-dynamic MPC for Dynamic Variable-frequency Humanoid Lo- comotion over Discrete Terrains,” inProceedings of Robotics: Science and Systems, (LosAngeles, CA, USA), June 2025

  27. [27]

    Fast alternating direction optimization methods,

    T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast alternating direction optimization methods,”SIAM Journal on Imaging Sciences, vol. 7, no. 3, pp. 1588–1623, 2014

  28. [28]

    mjlab: A lightweight framework for GPU-accelerated robot learning,

    K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel, “mjlab: A Lightweight Framework for GPU-Accelerated Robot Learn- ing,”arXiv preprint arXiv:2601.22074, 2026

  29. [29]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu, “Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion,”arXiv preprint arXiv:2508.08241, 2025

  30. [30]

    A Differentiable Interior-Point Method in Single Precision

    J. Arrizabalaga, K. Tracy, and Z. Manchester, “A Differen- tiable Interior-Point Method in Single Precision,”arXiv preprint arXiv:2605.17913, 2026

  31. [31]

    OptNet: Differentiable Optimization as a Layer in Neural Networks,

    B. Amos and J. Z. Kolter, “OptNet: Differentiable Optimization as a Layer in Neural Networks,”arXiv preprint arXiv:1703.00443, 2017

  32. [32]

    Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,

    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,”Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011

  33. [33]

    OSQP: An operator splitting solver for quadratic programs,

    B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP: An operator splitting solver for quadratic programs,” in2018 UKACC 12th international conference on control (CONTROL), pp. 339–339, IEEE, 2018

  34. [34]

    Clf-rl: Control lyapunov function guided reinforcement learning,

    K. Li, Z. Olkin, Y . Yue, and A. D. Ames, “Clf-rl: Control lyapunov function guided reinforcement learning,”IEEE Robotics and Automa- tion Letters, 2026. APPENDIX TABLE IV: Comparison of inertia-based gain selection for THEMIS v2. Herek g is the gear ratio,I rotor =A i/k2 g is the rotor inertia, andω n = 2πfn. Joint Method kg Irotor (kg m2) Ai (kg m2) Jef...