MPC-Injection: Biasing Off-Policy Locomotion RL Toward Controller-Induced Behavior Basins

Brian Plancher (Dartmouth College); Roy Xing (Dartmouth College); Seyoung Ree (Harvard University)

arxiv: 2606.26392 · v1 · pith:KE3EBX6Onew · submitted 2026-06-24 · 💻 cs.RO

MPC-Injection: Biasing Off-Policy Locomotion RL Toward Controller-Induced Behavior Basins

Roy Xing (Dartmouth College) , Seyoung Ree (Harvard University) , Brian Plancher (Dartmouth College) This is my paper

Pith reviewed 2026-06-26 01:32 UTC · model grok-4.3

classification 💻 cs.RO

keywords locomotion RLMPC injectionreplay buffer biasingbehavior basinoff-policy reinforcement learningquadruped gaitmodel predictive controlsim-to-real transfer

0 comments

The pith

MPC-Injection steers off-policy RL locomotion policies into controller-preferred behavior basins by injecting MPC transitions into the replay buffer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for robot locomotion often converges to locally optimal but unusable behaviors such as vibrating limbs or torso scooting that maximize return without producing deployable gaits. MPC-Injection counters this by inserting transitions generated by a model predictive controller solving the same Markov decision process directly into the RL replay buffer. This biases the policy toward the controller's preferred gait using only a one- or two-term task reward, without redesigning the reward or adding discriminators, retargeting, or auxiliary objectives. The resulting gaits are qualitatively comparable to those from reward shaping with twenty-one tuned terms or from adversarial motion priors, as demonstrated on a 2D walker in simulation and via sim-to-real transfer on a Go2 quadruped. A sympathetic reader would care because the approach transfers controller behavior purely through the state distribution in the buffer, simplifying reward design while reaching behaviors that pure RL may miss.

Core claim

By inserting transitions from an MPC controller into the off-policy RL replay buffer, the method biases actor-critic updates toward states visited by the controller, allowing the policy to reach desirable gaits even under one- or two-term task rewards that would otherwise lead to degenerate behaviors.

What carries the argument

MPC-Injection, the direct insertion of MPC-generated transitions into the RL replay buffer to shift the replay state distribution toward the controller's behavior basin.

If this is right

Policies reach usable gaits with rewards containing only one or two terms rather than twenty-one tuned terms.
No discriminator, kinematic retargeting, or auxiliary objective is required, unlike adversarial motion priors.
The bias works on a 2D walker in simulation and transfers to a real Go2 quadruped.
Actor-critic updates become directed toward controller-visited states that pure RL may not reach under simple rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The injection technique could reduce reliance on reward engineering in other off-policy RL control tasks where an approximate solver exists.
Varying the fraction or timing of injected transitions might allow tuning the strength of the bias versus exploration.
The approach may combine with existing basin-escaping methods to handle cases where the MPC itself is imperfect.

Load-bearing premise

Transitions from the MPC controller can be inserted into the RL replay buffer without introducing harmful distribution shift or instability in the actor-critic updates.

What would settle it

A trial in which the policy trained with MPC-Injection under the one- or two-term reward produces gaits no better than pure RL or diverges due to the injected transitions.

Figures

Figures reproduced from arXiv: 2606.26392 by Brian Plancher (Dartmouth College), Roy Xing (Dartmouth College), Seyoung Ree (Harvard University).

**Figure 2.** Figure 2: MPC-Injection: An MPC controller is deployed in the same simulation environment used [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Episodic return on the walker velocity-tracking reward across MPC-Injection ratios (0%, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: State-space structure and qualitative behavior under 0% and 25% MPC-Injection with [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Quadruped behavior comparison between pure RL and 25% MPC-Injection in simulation. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons between reward shaping (a), AMP (b), and MPC-Injection (c). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Snapshots of RL policies with the same simple torso velocity-tracking reward function on [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Log-scale box plot showing the evolution of torso-height distributions at different training [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Walker behavior under varying MPC-Injection ratios (25% top-left, 50% top-right, 75% [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Quadruped training curves across 0%, 25%, 50%, 75%, and 100% MPC-Injection for [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Reinforcement learning (RL) for locomotion frequently converges to locally optimal but undeployable behaviors, such as vibrating limbs or scooting on the torso, that maximize return without producing a usable gait. We present MPC-Injection, a low-overhead method that steers RL toward a designer-preferred gait by inserting transitions into the replay buffer from a model predictive controller solving the same Markov decision process. Unlike reward shaping, MPC-Injection does not require redesigning the task reward, and unlike adversarial imitation learning, it adds no discriminator, no kinematic retargeting, and no auxiliary objective. Instead, the controller's preferred behavior is transferred to the policy purely through the replay state distribution. On a 2D walker in simulation and with sim-to-real evaluation on a Go2 quadruped, we show that MPC-Injection drives the policy into the controller's behavior basin using a one to two-term task reward, producing gaits qualitatively comparable to those of reward shaping with twenty-one tuned terms and of adversarial motion priors without their discriminator and retargeting overhead. We further analyze how the injected transitions bias actor-critic updates toward controller-visited states, allowing the policy to learn behaviors that pure RL may fail to reach under simple reward functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPC-Injection adds MPC transitions to the RL replay buffer to bias toward usable gaits with minimal reward terms, but the support stays qualitative and the model mismatch risk is unaddressed in the given description.

read the letter

MPC-Injection inserts transitions from an MPC controller directly into the off-policy RL replay buffer to steer the policy toward the controller's preferred states. This uses only a one or two term task reward and avoids both multi-term reward shaping and the extra networks required by adversarial motion priors.

The paper does a clean job explaining the mechanism: the injected transitions change the state distribution seen by the actor-critic updates, pulling the policy into the desired basin without auxiliary objectives. The 2D walker simulation and Go2 sim-to-real examples show gaits that look comparable to the heavier baselines, which is the practical payoff for legged robot work.

The soft spot is the evidence. The abstract and description give only qualitative comparisons with no metrics, ablations, error bars, or protocol details, so it is difficult to gauge how consistent the improvement is or how much the injection actually reduces failure modes. The stress-test concern about distribution shift also lands: MPC typically runs on a simplified model and finite horizon, while the RL environment uses full simulator dynamics. If those transitions produce mismatched value targets, the bias could introduce instability rather than help, and nothing in the provided text shows checks for that.

This is for people working on off-policy RL for legged locomotion who want a lighter alternative to reward tuning or imitation. A reader already running similar setups could test the injection quickly and see whether it holds in their domain.

I would send it to peer review. The idea is direct and targets a real pain point, so it deserves a full look even if the experiments need to be expanded to make the claims stick.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MPC-Injection, a low-overhead technique that inserts transitions generated by an MPC controller (solving the same MDP) into the replay buffer of an off-policy actor-critic RL algorithm for locomotion. This biases the policy toward the controller-induced behavior basin, enabling usable gaits with only one- or two-term task rewards. Qualitative results are reported on a 2D walker in simulation and sim-to-real transfer on a Unitree Go2 quadruped, with an accompanying analysis of how the injected transitions affect actor-critic updates; the method is contrasted with reward shaping (21 terms) and adversarial motion priors (discriminator + retargeting).

Significance. If the central claim is supported, the approach provides a practical alternative for steering RL locomotion policies without reward redesign or auxiliary objectives, potentially lowering engineering overhead while leveraging existing MPC solutions. The replay-buffer biasing perspective and analysis of state-distribution effects on updates contribute a distinct angle on navigating behavior basins in off-policy RL.

major comments (2)

Abstract: the central claim that MPC-Injection produces gaits 'qualitatively comparable' to reward shaping and AMP rests on qualitative descriptions alone; no quantitative metrics, ablation studies, success rates, or error bars are supplied, leaving the empirical support for the claim difficult to evaluate.
Method description (MPC transition insertion): the load-bearing assumption that MPC-generated transitions can be inserted directly into the off-policy replay buffer without harmful distribution shift is not addressed; the paper provides no analysis or experiments quantifying alignment between the MPC (approximate model, receding horizon) and true simulator dynamics distributions or demonstrating stability of the resulting Q-targets under the minimal reward.

minor comments (1)

Abstract: the phrasing 'one to two-term task reward' would benefit from an explicit example of the reward terms employed in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of MPC-Injection as a low-overhead alternative. We address each major comment below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that MPC-Injection produces gaits 'qualitatively comparable' to reward shaping and AMP rests on qualitative descriptions alone; no quantitative metrics, ablation studies, success rates, or error bars are supplied, leaving the empirical support for the claim difficult to evaluate.

Authors: We agree that the empirical support would be strengthened by quantitative evidence. In the revision we will add metrics including average episode return, forward velocity tracking error, and a gait stability score (e.g., torso height variance) computed over 5–10 random seeds with error bars. We will also include an ablation on the fraction of injected transitions and report success rates (fraction of runs that produce a stable forward gait) for all methods under the minimal reward. revision: yes
Referee: Method description (MPC transition insertion): the load-bearing assumption that MPC-generated transitions can be inserted directly into the off-policy replay buffer without harmful distribution shift is not addressed; the paper provides no analysis or experiments quantifying alignment between the MPC (approximate model, receding horizon) and true simulator dynamics distributions or demonstrating stability of the Q-targets under the minimal reward.

Authors: The manuscript already contains an analysis of how injected transitions bias the actor and critic updates toward controller-visited states. However, we acknowledge the need for explicit quantification of distribution shift and Q-target stability. In revision we will add (i) a comparison of state-visitation distributions (e.g., via Wasserstein distance or KL divergence on key state variables) between pure MPC rollouts and the mixed replay buffer, and (ii) plots of Q-target variance and temporal-difference error on held-out states when training with versus without injection under the one- or two-term reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external MPC input

full rationale

The paper describes MPC-Injection as an empirical technique that inserts transitions generated by an external MPC controller (solving the same MDP) directly into the off-policy replay buffer to bias actor-critic updates. No derivation chain, equations, or self-citations are present in the provided text that reduce any claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation. The central claim is framed as an experimental outcome validated on 2D walker simulation and Go2 hardware, relying on standard RL components rather than internal equivalence by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no explicit free parameters, axioms, or invented entities beyond standard concepts in RL and MPC.

pith-pipeline@v0.9.1-grok · 5764 in / 1038 out tokens · 30913 ms · 2026-06-26T01:32:04.113560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 25 canonical work pages

[1]

M¨uller, E

D. M¨uller, E. Knoop, D. Mylonopoulos, A. Serifi, M. A. Hopkins, R. Grandia, and M. B¨acher. Olaf: Bringing an Animated Character to Life in the Physical World, Dec. 2025. URL http: //arxiv.org/abs/2512.16705. arXiv:2512.16705 [cs]

arXiv 2025
[2]

Liang, Z

Y . Liang, Z. Peng, Y . Zhao, and W. Yan. Learning robust bipedal running via structured gait and trajectory guidance.Robotica, pages 1–19, Jan. 2026. ISSN 0263-5747, 1469-8668. doi:10.1017/S0263574725103007. URL https://www.cambridge.org/core/product/ identifier/S0263574725103007/type/journal_article

work page doi:10.1017/s0263574725103007 2026
[3]

H. Xue, T. He, Z. Wang, Q. Ben, W. Xiao, Z. Luo, X. Da, F. Casta˜neda, G. Shi, S. Sastry, L. J. Fan, and Y . Zhu. Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer, Nov. 2025. URLhttp://arxiv.org/abs/2512.01061. arXiv:2512.01061 [cs]

arXiv 2025
[4]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control.ACM Transactions on Graphics, 40 (4):1–20, Aug. 2021. ISSN 0730-0301, 1557-7368. doi:10.1145/3450626.3459670. URL http://arxiv.org/abs/2104.02180. arXiv:2104.02180 [cs]

work page doi:10.1145/3450626.3459670 2021
[5]

C. Li, M. Vlastelica, S. Blaes, J. Frey, F. Grimminger, and G. Martius. Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations, Nov. 2022. URL http: //arxiv.org/abs/2206.11693. arXiv:2206.11693 [cs]

arXiv 2022
[6]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning, Oct
[7]

arXiv:2510.05070 [cs]

URLhttp://arxiv.org/abs/2510.05070. arXiv:2510.05070 [cs]

arXiv
[8]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. TWIST: Teleoperated Whole-Body Imitation System, May 2025. URL http://arxiv.org/abs/2505.02833. arXiv:2505.02833 [cs]

arXiv 2025
[9]

Dawood, N

M. Dawood, N. Dengler, J. De Heuvel, and M. Bennewitz. Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control. In2023 IEEE International Con- ference on Robotics and Automation (ICRA), pages 879–885, London, United Kingdom, May
[10]

Muguira-Iturralde, A

IEEE. ISBN 979-8-3503-2365-8. doi:10.1109/ICRA48891.2023.10161492. URL https://ieeexplore.ieee.org/document/10161492/

work page doi:10.1109/icra48891.2023.10161492 2023
[11]

J. Shin, A. Hakobyan, M. Park, Y . Kim, G. Kim, and I. Yang. Infusing model predictive control into meta-reinforcement learning for mobile robots in dynamic environments.IEEE Robotics and Automation Letters, 7(4):10065–10072, Oct. 2022. ISSN 2377-3766, 2377-3774. doi:10. 1109/LRA.2022.3191234. URL http://arxiv.org/abs/2109.07120. arXiv:2109.07120 [cs]

arXiv 2022
[12]

Br¨udigam, A.-A

J. Br¨udigam, A.-A. Abbas, M. Sorokin, K. Fang, B. Hung, M. Guru, S. Sosnowski, J. Wang, S. Hirche, and S. L. Cleac’h. Jacta: A Versatile Planner for Learning Dexterous and Whole-body Manipulation, Oct. 2024. URL http://arxiv.org/abs/2408.01258. arXiv:2408.01258 [cs]

arXiv 2024
[13]

A. D. Laud.Theory and application of reward shaping in reinforcement learning. PhD Thesis, University of Illinois at Urbana-Champaign, USA, 2004

2004
[14]

A. Y . Ng, D. Harada, and S. J. Russell. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. InProceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. 9

1999
[15]

S. H. Jeon, S. Heim, C. Khazoom, and S. Kim. Benchmarking Potential Based Rewards for Learning Humanoid Locomotion. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9204–9210, May 2023. doi:10.1109/ICRA48891.2023.10160885. URLhttp://arxiv.org/abs/2307.10142. arXiv:2307.10142 [cs]

work page doi:10.1109/icra48891.2023.10160885 2023
[16]

Malysheva, D

A. Malysheva, D. Kudenko, and A. Shpilman. Learning to Run with Potential-Based Reward Shaping and Demonstrations from Video Data. In2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), pages 286–291, Nov. 2018. doi:10.1109/ ICARCV .2018.8581310. URL http://arxiv.org/abs/2012.08824. arXiv:2012.08824 [cs]

arXiv 2018
[17]

Harutyunyan, S

A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe. Expressing Arbitrary Reward Functions as Potential-Based Advice.Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), Feb. 2015. ISSN 2374-3468, 2159-5399. doi:10.1609/aaai.v29i1.9628. URL https: //ojs.aaai.org/index.php/AAAI/article/view/9628

work page doi:10.1609/aaai.v29i1.9628 2015
[18]

Westenbroek, F

T. Westenbroek, F. Castaneda, A. Agrawal, S. Sastry, and K. Sreenath. Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning. In6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, Nov. 2022. doi:10.48550/arXiv.2208.06721. URL http://arxiv.org/abs/2208.06721. arXiv:2208.06721 [cs]

work page doi:10.48550/arxiv.2208.06721 2022
[19]

Kim, Y .-H

G. Kim, Y .-H. Lee, and H.-W. Park. A Learning Framework for Diverse Legged Robot Locomotion Using Barrier-Based Style Rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 10004–10010, Atlanta, GA, USA, May 2025. IEEE. ISBN 979-8-3315-4139-2. doi:10.1109/ICRA55743.2025.11128517. URL https: //ieeexplore.ieee.org/document/11128517/

work page doi:10.1109/icra55743.2025.11128517 2025
[20]

Y . Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not Only Rewards but Also Constraints: Applications on Legged Robot Locomotion.IEEE Transactions on Robotics, 40:2984–3003, 2024. ISSN 1552-3098, 1941-0468. doi:10.1109/TRO.2024.3400935. URL https://ieeexplore.ieee.org/document/10530429/

work page doi:10.1109/tro.2024.3400935 2024
[21]

L. Yang, B. Werner, M. d. Sa, and A. D. Ames. CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions, Oct. 2025. URL http://arxiv.org/ abs/2510.14959. arXiv:2510.14959 [cs]

Pith/arXiv arXiv 2025
[22]

G. Ji, J. Mun, H. Kim, and J. Hwangbo. Concurrent Training of a Control Policy and a State Estimator for Dynamic and Robust Legged Locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, Apr. 2022. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2022.3151396. URLhttp://arxiv.org/abs/2202.05481. arXiv:2202.05481 [cs]

work page doi:10.1109/lra.2022.3151396 2022
[23]

Rudin, D

N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. InProceedings of the 5th Conference on Robot Learn- ing, pages 91–100. PMLR, Jan. 2022. URL https://proceedings.mlr.press/v164/ rudin22a.html. ISSN: 2640-3498

2022
[24]

Heiden, D

J. Siekmann, Y . Godse, A. Fern, and J. Hurst. Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 7309–7315, Xi’an, China, May 2021. IEEE. ISBN 978-1-7281- 9077-8. doi:10.1109/ICRA48506.2021.9561814. URL https://ieeexplore.ieee.org/ document/9561814/

work page doi:10.1109/icra48506.2021.9561814 2021
[25]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning Quadrupedal Lo- comotion over Challenging Terrain.Science Robotics, 5(47):eabc5986, Oct. 2020. ISSN 2470-9476. doi:10.1126/scirobotics.abc5986. URL http://arxiv.org/abs/2010.11251. arXiv:2010.11251 [cs]. 10

work page doi:10.1126/scirobotics.abc5986 2020
[26]

Escontrela, X

A. Escontrela, X. B. Peng, W. Yu, T. Zhang, A. Iscen, K. Goldberg, and P. Abbeel. Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 25–32, Kyoto, Japan, Oct. 2022. IEEE. ISBN 978-1-6654-7927-1. doi:10.1109/IROS47612.2022.9981973. URL https:...

work page doi:10.1109/iros47612.2022.9981973 2022
[27]

Pollard, J

N. Pollard, J. Hodgins, M. Riley, and C. Atkeson. Adapting human motion for the control of a humanoid robot. InProceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292), volume 2, pages 1390–1397 vol.2, May 2002. doi:10.1109/ ROBOT.2002.1014737. URLhttps://ieeexplore.ieee.org/document/1014737/

arXiv 2002
[28]

Grimes, R

D. Grimes, R. Chalodhorn, and R. Rao. Dynamic Imitation in a Humanoid Robot through Nonparametric Probabilistic Inference. InRobotics: Science and Systems II. Robotics: Science and Systems Foundation, Aug. 2006. ISBN 978-0-262-69348-6. doi:10.15607/RSS.2006.II.026. URLhttp://www.roboticsproceedings.org/rss02/p26.pdf

work page doi:10.15607/rss.2006.ii.026 2006
[29]

Kurtz and J

V . Kurtz and J. W. Burdick. Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks, May 2025. URL http://arxiv.org/abs/2502.13406. arXiv:2502.13406 [cs]

arXiv 2025
[30]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. GMT: General Motion Tracking for Humanoid Whole-Body Control, Sept. 2025. URLhttp://arxiv.org/abs/2506.14770. arXiv:2506.14770 [cs]

arXiv 2025
[31]

Muguira-Iturralde, A

Y . Fuchioka, Z. Xie, and M. Van De Panne. OPT-Mimic: Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5092–5098, London, United Kingdom, May 2023. IEEE. ISBN 979-8-3503-2365-8. doi:10.1109/ICRA48891.2023.10160562. URL https://ieeexplore. ieee.org/document/10160562/

work page doi:10.1109/icra48891.2023.10160562 2023
[32]

Code as policies: Language model programs for embodied control

A. George, A. Bartsch, and A. B. Farimani. Minimizing Human Assistance: Augmenting a Sin- gle Demonstration for Deep Reinforcement Learning. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5027–5033, May 2023. doi:10.1109/ICRA48891. 2023.10161119. URLhttps://ieeexplore.ieee.org/document/10161119/

work page doi:10.1109/icra48891 2023
[33]

Vecerik, T

M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Roth¨orl, T. Lampe, and M. Riedmiller. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards, Oct. 2018. URL http://arxiv.org/abs/1707.08817. arXiv:1707.08817 [cs]

Pith/arXiv arXiv 2018
[34]

A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel. Overcoming Exploration in Reinforcement Learning with Demonstrations. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299, May 2018. doi:10.1109/ICRA.2018. 8463162. URL https://ieeexplore.ieee.org/document/8463162/. ISSN: 2577-087X

work page doi:10.1109/icra.2018 2018
[35]

H. Zhou, X. Zhang, and V . Tzoumas. Adaptive Legged Locomotion via Online Learning for Model Predictive Control.IEEE Robotics and Automation Letters, 11(2):1778–1785, Feb. 2026. ISSN 2377-3766. doi:10.1109/LRA.2025.3644161. URL https://ieeexplore.ieee.org/ document/11299577/

work page doi:10.1109/lra.2025.3644161 2026
[36]

Cheng, D

J. Cheng, D. Kang, G. Fadini, G. Shi, and S. Coros. Rambo: RL-Augmented Model-Based Whole-Body Control for Loco-Manipulation.IEEE Robotics and Automation Letters, 10(9): 9462–9469, Sept. 2025. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2025.3594984. URL https://ieeexplore.ieee.org/document/11106746/

work page doi:10.1109/lra.2025.3594984 2025
[37]

S. H. Jeon, H. J. Lee, S. Hong, and S. Kim. Residual MPC: Blending Reinforcement Learning with GPU-Parallelized Model Predictive Control, Oct. 2025. URL http://arxiv.org/abs/ 2510.12717. arXiv:2510.12717 [cs]. 11

arXiv 2025
[38]

Bogdanovic, M

M. Bogdanovic, M. Khadiv, and L. Righetti. Model-free reinforcement learning for robust locomotion using demonstrations from trajectory optimization.Frontiers in Robotics and AI, 9:854212, Aug. 2022. ISSN 2296-9144. doi:10.3389/frobt.2022.854212. URL https: //www.frontiersin.org/articles/10.3389/frobt.2022.854212/full

work page doi:10.3389/frobt.2022.854212 2022
[39]

Carius, F

J. Carius, F. Farshidian, and M. Hutter. MPC-Net: A First Principles Guided Policy Search. IEEE Robotics and Automation Letters, 5(2):2897–2904, Apr. 2020. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2020.2974653. URL http://arxiv.org/abs/1909.05197. arXiv:1909.05197 [cs]

work page doi:10.1109/lra.2020.2974653 2020
[40]

Levine and V

S. Levine and V . Koltun. Guided policy search. In S. Dasgupta and D. McAllester, editors,Pro- ceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URLhttps://proceedings.mlr.press/v28/levine13.html

2013
[41]

B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for End-to-end Planning and Control, Oct. 2019. URL http://arxiv.org/abs/1810.13400. arXiv:1810.13400 [cs]

arXiv 2019
[42]

Adabag, M

E. Adabag, M. Greiff, J. Subosits, and T. Lew. Differentiable Model Predictive Control on the GPU, Oct. 2025. URLhttp://arxiv.org/abs/2510.06179. arXiv:2510.06179 [math]

arXiv 2025
[43]

Reiter, J

R. Reiter, J. Hoffmann, D. Reinhardt, F. Messerer, K. Baumg¨artner, S. Sawant, J. B ¨odecker, M. Diehl, and S. Gros. Synthesis of model predictive control and reinforcement learning: Survey and classification.Annual Reviews in Control, 61:101045, 2026. ISSN 13675788. doi: 10.1016/j.arcontrol.2026.101045. URL https://linkinghub.elsevier.com/retrieve/ pii/S...

work page doi:10.1016/j.arcontrol.2026.101045 2026
[44]

Howell, N

T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y . Tassa. Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo. Dec 2022. doi:10.48550/arXiv.2212. 00541. URLhttps://arxiv.org/abs/2212.00541

work page doi:10.48550/arxiv.2212 2022
[45]

Amatucci, J

L. Amatucci, J. Sousa-Pinto, G. Turrisi, D. Orban, V . Barasuol, and C. Semini. Primal-Dual iLQR for GPU-Accelerated Learning and Control in Legged Robots.IEEE Robotics and Automation Letters, 11(1):1010–1017, Jan. 2026. ISSN 2377-3766, 2377-3774. doi:10.1109/ LRA.2025.3632610. URLhttps://ieeexplore.ieee.org/document/11248841/

arXiv 2026
[46]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR, 2018. URL https://proceedings. ...

2018
[47]

Fujimoto, H

S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1587–1596. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/fujimoto18a.html

2018
[48]

Tassa, Y

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. Deepmind control suite, 2018. URL https://arxiv.org/abs/1801.00690

Pith/arXiv arXiv 2018
[49]

Unitree Go2

Unitree Robotics. Unitree Go2. https://www.unitree.com/go2/. Product page. Accessed: 2026-05-17

2026
[50]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,
[51]

doi:10.1109/IROS.2012.6386109. 12

work page doi:10.1109/iros.2012.6386109 2012
[52]

Unitree RL mjlab

Unitree Robotics. Unitree RL mjlab. https://github.com/unitreerobotics/unitree_ rl_mjlab, 2026. GitHub repository. Accessed: 2026-05-15

2026
[53]

G. Kim, D. Kang, J.-H. Kim, S. Hong, and H.-W. Park. Contact-implicit model predictive control: Controlling diverse quadruped motions without pre-planned contact modes or trajectories. The International Journal of Robotics Research, 44(3):486–510, Mar. 2025. doi:10.1177/ 02783649241273645. URLhttps://doi.org/10.1177/02783649241273645

work page doi:10.1177/02783649241273645 2025
[54]

A. Du, E. Adabag, G. Bravo Palacios, and B. Plancher. Gato: Gpu-accelerated and batched trajectory optimization for scalable edge model predictive control. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[55]

ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,

K. Nguyen, S. Schoedel, A. Alavilli, B. Plancher, and Z. Manchester. Tinympc: Model- predictive control on resource-constrained microcontrollers. In2024 IEEE International Con- ference on Robotics and Automation (ICRA), 2024. doi:10.1109/ICRA57147.2024.10610987. URLhttps://doi.org/10.1109/ICRA57147.2024.10610987

work page doi:10.1109/icra57147.2024.10610987 2024
[56]

Schramm, P

F. Schramm, P. Fabre, N. Perrin-Gilbert, and J. Carpentier. Reference-Free Sampling- Based Model Predictive Control, Nov. 2025. URL http://arxiv.org/abs/2511.19204. arXiv:2511.19204 [cs]

Pith/arXiv arXiv 2025
[57]

∞X k=0 γkl(Sk, Ak)|S 0 =s # (4) and the action-value function Qπ(s, a) :=E

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22 (268):1–8, 2021. URLhttp://jmlr.org/papers/v22/20-1364.html. A Problem Statement This section introduces the two main theoretical components of MPC-Injection, the MPC control...

2021

[1] [1]

M¨uller, E

D. M¨uller, E. Knoop, D. Mylonopoulos, A. Serifi, M. A. Hopkins, R. Grandia, and M. B¨acher. Olaf: Bringing an Animated Character to Life in the Physical World, Dec. 2025. URL http: //arxiv.org/abs/2512.16705. arXiv:2512.16705 [cs]

arXiv 2025

[2] [2]

Liang, Z

Y . Liang, Z. Peng, Y . Zhao, and W. Yan. Learning robust bipedal running via structured gait and trajectory guidance.Robotica, pages 1–19, Jan. 2026. ISSN 0263-5747, 1469-8668. doi:10.1017/S0263574725103007. URL https://www.cambridge.org/core/product/ identifier/S0263574725103007/type/journal_article

work page doi:10.1017/s0263574725103007 2026

[3] [3]

H. Xue, T. He, Z. Wang, Q. Ben, W. Xiao, Z. Luo, X. Da, F. Casta˜neda, G. Shi, S. Sastry, L. J. Fan, and Y . Zhu. Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer, Nov. 2025. URLhttp://arxiv.org/abs/2512.01061. arXiv:2512.01061 [cs]

arXiv 2025

[4] [4]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control.ACM Transactions on Graphics, 40 (4):1–20, Aug. 2021. ISSN 0730-0301, 1557-7368. doi:10.1145/3450626.3459670. URL http://arxiv.org/abs/2104.02180. arXiv:2104.02180 [cs]

work page doi:10.1145/3450626.3459670 2021

[5] [5]

C. Li, M. Vlastelica, S. Blaes, J. Frey, F. Grimminger, and G. Martius. Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations, Nov. 2022. URL http: //arxiv.org/abs/2206.11693. arXiv:2206.11693 [cs]

arXiv 2022

[6] [6]

S. Zhao, Y . Ze, Y . Wang, C. K. Liu, P. Abbeel, G. Shi, and R. Duan. ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning, Oct

[7] [7]

arXiv:2510.05070 [cs]

URLhttp://arxiv.org/abs/2510.05070. arXiv:2510.05070 [cs]

arXiv

[8] [8]

Y . Ze, Z. Chen, J. P. Ara´ujo, Z.-a. Cao, X. B. Peng, J. Wu, and C. K. Liu. TWIST: Teleoperated Whole-Body Imitation System, May 2025. URL http://arxiv.org/abs/2505.02833. arXiv:2505.02833 [cs]

arXiv 2025

[9] [9]

Dawood, N

M. Dawood, N. Dengler, J. De Heuvel, and M. Bennewitz. Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control. In2023 IEEE International Con- ference on Robotics and Automation (ICRA), pages 879–885, London, United Kingdom, May

[10] [10]

Muguira-Iturralde, A

IEEE. ISBN 979-8-3503-2365-8. doi:10.1109/ICRA48891.2023.10161492. URL https://ieeexplore.ieee.org/document/10161492/

work page doi:10.1109/icra48891.2023.10161492 2023

[11] [11]

J. Shin, A. Hakobyan, M. Park, Y . Kim, G. Kim, and I. Yang. Infusing model predictive control into meta-reinforcement learning for mobile robots in dynamic environments.IEEE Robotics and Automation Letters, 7(4):10065–10072, Oct. 2022. ISSN 2377-3766, 2377-3774. doi:10. 1109/LRA.2022.3191234. URL http://arxiv.org/abs/2109.07120. arXiv:2109.07120 [cs]

arXiv 2022

[12] [12]

Br¨udigam, A.-A

J. Br¨udigam, A.-A. Abbas, M. Sorokin, K. Fang, B. Hung, M. Guru, S. Sosnowski, J. Wang, S. Hirche, and S. L. Cleac’h. Jacta: A Versatile Planner for Learning Dexterous and Whole-body Manipulation, Oct. 2024. URL http://arxiv.org/abs/2408.01258. arXiv:2408.01258 [cs]

arXiv 2024

[13] [13]

A. D. Laud.Theory and application of reward shaping in reinforcement learning. PhD Thesis, University of Illinois at Urbana-Champaign, USA, 2004

2004

[14] [14]

A. Y . Ng, D. Harada, and S. J. Russell. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. InProceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. 9

1999

[15] [15]

S. H. Jeon, S. Heim, C. Khazoom, and S. Kim. Benchmarking Potential Based Rewards for Learning Humanoid Locomotion. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9204–9210, May 2023. doi:10.1109/ICRA48891.2023.10160885. URLhttp://arxiv.org/abs/2307.10142. arXiv:2307.10142 [cs]

work page doi:10.1109/icra48891.2023.10160885 2023

[16] [16]

Malysheva, D

A. Malysheva, D. Kudenko, and A. Shpilman. Learning to Run with Potential-Based Reward Shaping and Demonstrations from Video Data. In2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), pages 286–291, Nov. 2018. doi:10.1109/ ICARCV .2018.8581310. URL http://arxiv.org/abs/2012.08824. arXiv:2012.08824 [cs]

arXiv 2018

[17] [17]

Harutyunyan, S

A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe. Expressing Arbitrary Reward Functions as Potential-Based Advice.Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), Feb. 2015. ISSN 2374-3468, 2159-5399. doi:10.1609/aaai.v29i1.9628. URL https: //ojs.aaai.org/index.php/AAAI/article/view/9628

work page doi:10.1609/aaai.v29i1.9628 2015

[18] [18]

Westenbroek, F

T. Westenbroek, F. Castaneda, A. Agrawal, S. Sastry, and K. Sreenath. Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning. In6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, Nov. 2022. doi:10.48550/arXiv.2208.06721. URL http://arxiv.org/abs/2208.06721. arXiv:2208.06721 [cs]

work page doi:10.48550/arxiv.2208.06721 2022

[19] [19]

Kim, Y .-H

G. Kim, Y .-H. Lee, and H.-W. Park. A Learning Framework for Diverse Legged Robot Locomotion Using Barrier-Based Style Rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 10004–10010, Atlanta, GA, USA, May 2025. IEEE. ISBN 979-8-3315-4139-2. doi:10.1109/ICRA55743.2025.11128517. URL https: //ieeexplore.ieee.org/document/11128517/

work page doi:10.1109/icra55743.2025.11128517 2025

[20] [20]

Y . Kim, H. Oh, J. Lee, J. Choi, G. Ji, M. Jung, D. Youm, and J. Hwangbo. Not Only Rewards but Also Constraints: Applications on Legged Robot Locomotion.IEEE Transactions on Robotics, 40:2984–3003, 2024. ISSN 1552-3098, 1941-0468. doi:10.1109/TRO.2024.3400935. URL https://ieeexplore.ieee.org/document/10530429/

work page doi:10.1109/tro.2024.3400935 2024

[21] [21]

L. Yang, B. Werner, M. d. Sa, and A. D. Ames. CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions, Oct. 2025. URL http://arxiv.org/ abs/2510.14959. arXiv:2510.14959 [cs]

Pith/arXiv arXiv 2025

[22] [22]

G. Ji, J. Mun, H. Kim, and J. Hwangbo. Concurrent Training of a Control Policy and a State Estimator for Dynamic and Robust Legged Locomotion.IEEE Robotics and Automation Letters, 7(2):4630–4637, Apr. 2022. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2022.3151396. URLhttp://arxiv.org/abs/2202.05481. arXiv:2202.05481 [cs]

work page doi:10.1109/lra.2022.3151396 2022

[23] [23]

Rudin, D

N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. InProceedings of the 5th Conference on Robot Learn- ing, pages 91–100. PMLR, Jan. 2022. URL https://proceedings.mlr.press/v164/ rudin22a.html. ISSN: 2640-3498

2022

[24] [24]

Heiden, D

J. Siekmann, Y . Godse, A. Fern, and J. Hurst. Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 7309–7315, Xi’an, China, May 2021. IEEE. ISBN 978-1-7281- 9077-8. doi:10.1109/ICRA48506.2021.9561814. URL https://ieeexplore.ieee.org/ document/9561814/

work page doi:10.1109/icra48506.2021.9561814 2021

[25] [25]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning Quadrupedal Lo- comotion over Challenging Terrain.Science Robotics, 5(47):eabc5986, Oct. 2020. ISSN 2470-9476. doi:10.1126/scirobotics.abc5986. URL http://arxiv.org/abs/2010.11251. arXiv:2010.11251 [cs]. 10

work page doi:10.1126/scirobotics.abc5986 2020

[26] [26]

Escontrela, X

A. Escontrela, X. B. Peng, W. Yu, T. Zhang, A. Iscen, K. Goldberg, and P. Abbeel. Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 25–32, Kyoto, Japan, Oct. 2022. IEEE. ISBN 978-1-6654-7927-1. doi:10.1109/IROS47612.2022.9981973. URL https:...

work page doi:10.1109/iros47612.2022.9981973 2022

[27] [27]

Pollard, J

N. Pollard, J. Hodgins, M. Riley, and C. Atkeson. Adapting human motion for the control of a humanoid robot. InProceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292), volume 2, pages 1390–1397 vol.2, May 2002. doi:10.1109/ ROBOT.2002.1014737. URLhttps://ieeexplore.ieee.org/document/1014737/

arXiv 2002

[28] [28]

Grimes, R

D. Grimes, R. Chalodhorn, and R. Rao. Dynamic Imitation in a Humanoid Robot through Nonparametric Probabilistic Inference. InRobotics: Science and Systems II. Robotics: Science and Systems Foundation, Aug. 2006. ISBN 978-0-262-69348-6. doi:10.15607/RSS.2006.II.026. URLhttp://www.roboticsproceedings.org/rss02/p26.pdf

work page doi:10.15607/rss.2006.ii.026 2006

[29] [29]

Kurtz and J

V . Kurtz and J. W. Burdick. Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks, May 2025. URL http://arxiv.org/abs/2502.13406. arXiv:2502.13406 [cs]

arXiv 2025

[30] [30]

Z. Chen, M. Ji, X. Cheng, X. Peng, X. B. Peng, and X. Wang. GMT: General Motion Tracking for Humanoid Whole-Body Control, Sept. 2025. URLhttp://arxiv.org/abs/2506.14770. arXiv:2506.14770 [cs]

arXiv 2025

[31] [31]

Muguira-Iturralde, A

Y . Fuchioka, Z. Xie, and M. Van De Panne. OPT-Mimic: Imitation of Optimized Trajectories for Dynamic Quadruped Behaviors. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5092–5098, London, United Kingdom, May 2023. IEEE. ISBN 979-8-3503-2365-8. doi:10.1109/ICRA48891.2023.10160562. URL https://ieeexplore. ieee.org/document/10160562/

work page doi:10.1109/icra48891.2023.10160562 2023

[32] [32]

Code as policies: Language model programs for embodied control

A. George, A. Bartsch, and A. B. Farimani. Minimizing Human Assistance: Augmenting a Sin- gle Demonstration for Deep Reinforcement Learning. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5027–5033, May 2023. doi:10.1109/ICRA48891. 2023.10161119. URLhttps://ieeexplore.ieee.org/document/10161119/

work page doi:10.1109/icra48891 2023

[33] [33]

Vecerik, T

M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Roth¨orl, T. Lampe, and M. Riedmiller. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards, Oct. 2018. URL http://arxiv.org/abs/1707.08817. arXiv:1707.08817 [cs]

Pith/arXiv arXiv 2018

[34] [34]

A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel. Overcoming Exploration in Reinforcement Learning with Demonstrations. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299, May 2018. doi:10.1109/ICRA.2018. 8463162. URL https://ieeexplore.ieee.org/document/8463162/. ISSN: 2577-087X

work page doi:10.1109/icra.2018 2018

[35] [35]

H. Zhou, X. Zhang, and V . Tzoumas. Adaptive Legged Locomotion via Online Learning for Model Predictive Control.IEEE Robotics and Automation Letters, 11(2):1778–1785, Feb. 2026. ISSN 2377-3766. doi:10.1109/LRA.2025.3644161. URL https://ieeexplore.ieee.org/ document/11299577/

work page doi:10.1109/lra.2025.3644161 2026

[36] [36]

Cheng, D

J. Cheng, D. Kang, G. Fadini, G. Shi, and S. Coros. Rambo: RL-Augmented Model-Based Whole-Body Control for Loco-Manipulation.IEEE Robotics and Automation Letters, 10(9): 9462–9469, Sept. 2025. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2025.3594984. URL https://ieeexplore.ieee.org/document/11106746/

work page doi:10.1109/lra.2025.3594984 2025

[37] [37]

S. H. Jeon, H. J. Lee, S. Hong, and S. Kim. Residual MPC: Blending Reinforcement Learning with GPU-Parallelized Model Predictive Control, Oct. 2025. URL http://arxiv.org/abs/ 2510.12717. arXiv:2510.12717 [cs]. 11

arXiv 2025

[38] [38]

Bogdanovic, M

M. Bogdanovic, M. Khadiv, and L. Righetti. Model-free reinforcement learning for robust locomotion using demonstrations from trajectory optimization.Frontiers in Robotics and AI, 9:854212, Aug. 2022. ISSN 2296-9144. doi:10.3389/frobt.2022.854212. URL https: //www.frontiersin.org/articles/10.3389/frobt.2022.854212/full

work page doi:10.3389/frobt.2022.854212 2022

[39] [39]

Carius, F

J. Carius, F. Farshidian, and M. Hutter. MPC-Net: A First Principles Guided Policy Search. IEEE Robotics and Automation Letters, 5(2):2897–2904, Apr. 2020. ISSN 2377-3766, 2377-3774. doi:10.1109/LRA.2020.2974653. URL http://arxiv.org/abs/1909.05197. arXiv:1909.05197 [cs]

work page doi:10.1109/lra.2020.2974653 2020

[40] [40]

Levine and V

S. Levine and V . Koltun. Guided policy search. In S. Dasgupta and D. McAllester, editors,Pro- ceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 1–9, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URLhttps://proceedings.mlr.press/v28/levine13.html

2013

[41] [41]

B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for End-to-end Planning and Control, Oct. 2019. URL http://arxiv.org/abs/1810.13400. arXiv:1810.13400 [cs]

arXiv 2019

[42] [42]

Adabag, M

E. Adabag, M. Greiff, J. Subosits, and T. Lew. Differentiable Model Predictive Control on the GPU, Oct. 2025. URLhttp://arxiv.org/abs/2510.06179. arXiv:2510.06179 [math]

arXiv 2025

[43] [43]

Reiter, J

R. Reiter, J. Hoffmann, D. Reinhardt, F. Messerer, K. Baumg¨artner, S. Sawant, J. B ¨odecker, M. Diehl, and S. Gros. Synthesis of model predictive control and reinforcement learning: Survey and classification.Annual Reviews in Control, 61:101045, 2026. ISSN 13675788. doi: 10.1016/j.arcontrol.2026.101045. URL https://linkinghub.elsevier.com/retrieve/ pii/S...

work page doi:10.1016/j.arcontrol.2026.101045 2026

[44] [44]

Howell, N

T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y . Tassa. Predictive Sampling: Real-time Behaviour Synthesis with MuJoCo. Dec 2022. doi:10.48550/arXiv.2212. 00541. URLhttps://arxiv.org/abs/2212.00541

work page doi:10.48550/arxiv.2212 2022

[45] [45]

Amatucci, J

L. Amatucci, J. Sousa-Pinto, G. Turrisi, D. Orban, V . Barasuol, and C. Semini. Primal-Dual iLQR for GPU-Accelerated Learning and Control in Legged Robots.IEEE Robotics and Automation Letters, 11(1):1010–1017, Jan. 2026. ISSN 2377-3766, 2377-3774. doi:10.1109/ LRA.2025.3632610. URLhttps://ieeexplore.ieee.org/document/11248841/

arXiv 2026

[46] [46]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR, 2018. URL https://proceedings. ...

2018

[47] [47]

Fujimoto, H

S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1587–1596. PMLR, 2018. URLhttps://proceedings.mlr.press/v80/fujimoto18a.html

2018

[48] [48]

Tassa, Y

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. Deepmind control suite, 2018. URL https://arxiv.org/abs/1801.00690

Pith/arXiv arXiv 2018

[49] [49]

Unitree Go2

Unitree Robotics. Unitree Go2. https://www.unitree.com/go2/. Product page. Accessed: 2026-05-17

2026

[50] [50]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE,

[51] [51]

doi:10.1109/IROS.2012.6386109. 12

work page doi:10.1109/iros.2012.6386109 2012

[52] [52]

Unitree RL mjlab

Unitree Robotics. Unitree RL mjlab. https://github.com/unitreerobotics/unitree_ rl_mjlab, 2026. GitHub repository. Accessed: 2026-05-15

2026

[53] [53]

G. Kim, D. Kang, J.-H. Kim, S. Hong, and H.-W. Park. Contact-implicit model predictive control: Controlling diverse quadruped motions without pre-planned contact modes or trajectories. The International Journal of Robotics Research, 44(3):486–510, Mar. 2025. doi:10.1177/ 02783649241273645. URLhttps://doi.org/10.1177/02783649241273645

work page doi:10.1177/02783649241273645 2025

[54] [54]

A. Du, E. Adabag, G. Bravo Palacios, and B. Plancher. Gato: Gpu-accelerated and batched trajectory optimization for scalable edge model predictive control. In2026 IEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[55] [55]

ConceptGraphs: Open-vocabulary 3d scene graphs for perception and planning,

K. Nguyen, S. Schoedel, A. Alavilli, B. Plancher, and Z. Manchester. Tinympc: Model- predictive control on resource-constrained microcontrollers. In2024 IEEE International Con- ference on Robotics and Automation (ICRA), 2024. doi:10.1109/ICRA57147.2024.10610987. URLhttps://doi.org/10.1109/ICRA57147.2024.10610987

work page doi:10.1109/icra57147.2024.10610987 2024

[56] [56]

Schramm, P

F. Schramm, P. Fabre, N. Perrin-Gilbert, and J. Carpentier. Reference-Free Sampling- Based Model Predictive Control, Nov. 2025. URL http://arxiv.org/abs/2511.19204. arXiv:2511.19204 [cs]

Pith/arXiv arXiv 2025

[57] [57]

∞X k=0 γkl(Sk, Ak)|S 0 =s # (4) and the action-value function Qπ(s, a) :=E

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22 (268):1–8, 2021. URLhttp://jmlr.org/papers/v22/20-1364.html. A Problem Statement This section introduces the two main theoretical components of MPC-Injection, the MPC control...

2021