pith. machine review for the scientific record. sign in

arxiv: 2605.01978 · v2 · submitted 2026-05-03 · 📡 eess.SY · cs.RO· cs.SY

Recognition: 3 theorem links

· Lean Theorem

Stability of Control Lyapunov Function Guided Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3

classification 📡 eess.SY cs.ROcs.SY
keywords reinforcement learningcontrol Lyapunov functionexponential stabilityreward shapinghumanoid robotoptimal controllocomotion
0
0 comments X

The pith

Reinforcement learning policies shaped by control Lyapunov functions achieve exponential stability in continuous and discrete time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that shaping reinforcement learning rewards with a control Lyapunov function produces optimal policies that exponentially stabilize the underlying dynamical system. This result is shown for both the core CLF reward components and the additional terms used in real implementations, holding in continuous-time and discrete-time settings. Numerical checks on simple systems like the double integrator and cart-pole match the derived bounds. The approach is then applied to generate stable walking on a humanoid robot. If correct, this provides theoretical backing for why CLF-guided RL succeeds in practice on complex robots.

Core claim

Viewing the RL problem as an optimal control problem, the authors prove that the optimal controller obtained from a CLF-shaped reward renders the closed-loop dynamics exponentially stable. The proof covers both continuous and discrete time and extends to the full reward functions employed in practice, which include extra terms beyond the basic CLF components.

What carries the argument

The control Lyapunov function (CLF) used to synthesize the reinforcement learning reward function, which encodes a stability condition into the optimization objective solved by RL.

Load-bearing premise

A valid control Lyapunov function must exist for the system, and the reinforcement learning procedure must converge to the optimal policy induced by the shaped reward.

What would settle it

A counterexample would be a dynamical system possessing a known CLF where the policy learned from the CLF-shaped reward fails to produce exponential convergence to the equilibrium.

Figures

Figures reproduced from arXiv: 2605.01978 by Aaron D. Ames, William D. Compton, Zachary Olkin.

Figure 1
Figure 1. Figure 1: The main ideas behind CLF guided RL. A CLF is designed offline view at source ↗
Figure 2
Figure 2. Figure 2: Theoretical bounds plotted against the numerical solution to the optimal control problem for the continuous time double integrator with the view at source ↗
Figure 3
Figure 3. Figure 3: The optimal control of the continuous time cart-pole system using the CLF-RL costs plotted with the theoretical bounds. The cart-pole system view at source ↗
Figure 4
Figure 4. Figure 4: Theoretical bounds and the numerical solution to the discrete time double integrator system. The optimal control problem was solved numerically, view at source ↗
Figure 5
Figure 5. Figure 5: Results of the double integrator control policy trained with RL. view at source ↗
Figure 6
Figure 6. Figure 6: CLF-RL applied to a humanoid robot. Rather than tracking a single view at source ↗
read the original abstract

Reinforcement learning (RL) has become the de facto method for achieving locomotion on humanoid robots in practice, yet stability analysis of the corresponding control policies is lacking. Recent work has attempted to merge control theoretic ideas with reinforcement learning through control guided learning. A notable example of this is the use of a control Lyapunov function (CLF) to synthesize the reinforcement learning rewards, a technique known as CLF-RL, which has shown practical success. This paper investigates the stability properties of optimal controllers using CLF-RL with the goal of bridging experimentally observed stability with theoretical guarantees. The RL problem is viewed as an optimal control problem and exponential stability is proven in both continuous and discrete time using both core CLF reward terms and the additional terms used in practice. The theoretical bounds are numerically verified on systems such as the double integrator and cart-pole. Finally, the CLF guided rewards are implemented for a walking humanoid robot to generate stable periodic orbits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper views CLF-RL as an optimal control problem whose reward is shaped by a control Lyapunov function. It claims to prove exponential stability of the resulting optimal policy in both continuous and discrete time, for both the core CLF reward terms and the additional practical terms. The theoretical bounds are numerically verified on the double integrator and cart-pole; the same reward construction is then implemented on a humanoid robot to produce stable periodic walking orbits.

Significance. If the stability proofs are correct, the work supplies a useful bridge between control-theoretic guarantees and the CLF-shaped rewards already used in practice for locomotion. The explicit reduction to optimal control and the use of standard CLF decrease conditions to obtain exponential bounds are strengths, as is the numerical check on standard benchmark systems. The practical humanoid demonstration shows the method is implementable, but the overall significance is tempered by the lack of any suboptimality or robustness margin for the approximate policies that RL actually produces.

major comments (2)
  1. [stability proofs (continuous and discrete time)] The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.
  2. [humanoid robot implementation] Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.
minor comments (2)
  1. The assumptions required for the stability theorems (existence of a valid CLF, controllability, etc.) should be collected in a single, explicit list or theorem statement rather than scattered through the text.
  2. Notation: the value function induced by the CLF reward is easily confused with the standard RL value function; a short clarifying remark or distinct symbol would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing to revisions that clarify the scope of the results without altering the core claims.

read point-by-point responses
  1. Referee: The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.

    Authors: The theoretical results establish exponential stability for the exact optimal policy induced by the CLF-shaped reward, with this assumption stated explicitly in the problem formulation and theorem statements. We do not derive suboptimality or robustness margins for approximate policies, as these would require additional assumptions on the RL algorithm and are beyond the paper's scope. We will revise the abstract, introduction, and conclusion to emphasize that the guarantees apply strictly to the optimal policy and to note the distinction from practical RL approximations. revision: yes

  2. Referee: Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.

    Authors: The humanoid demonstration illustrates the practical use of the CLF-guided reward to produce stable periodic walking via visual confirmation of orbits. Because the exact optimal policy cannot be computed for this high-dimensional system, direct verification of the Lyapunov decrease condition or region of attraction on the hardware is not possible. We will revise the relevant section to clarify that the example is an empirical illustration of the reward design rather than a formal verification of the theoretical bounds. revision: yes

Circularity Check

0 steps flagged

No circularity: stability derived from standard CLF and optimal-control theorems

full rationale

The derivation views the CLF-shaped RL problem as an optimal control problem and invokes standard Lyapunov decrease conditions plus optimality to prove exponential stability for both core and practical reward terms in continuous and discrete time. This rests on external definitions of CLFs and value functions rather than any fitted quantity, self-defined term, or self-citation chain that reduces the claim to its own inputs. Benchmark simulations and the humanoid implementation serve as external checks. No load-bearing step matches the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a CLF for the plant and on convergence of RL to the optimal policy; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption A control Lyapunov function exists for the dynamical system under consideration
    Invoked to construct the reward that guides the RL policy toward stability.
  • domain assumption The reinforcement learning algorithm converges to the optimal policy defined by the CLF-shaped reward
    Required to equate the learned controller with the optimal controller whose stability is proven.

pith-pipeline@v0.9.0 · 5464 in / 1300 out tokens · 58724 ms · 2026-05-08T19:30:29.965944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,

    S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa, “The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,” inProceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1. Maui, HI, USA: IEEE, 2001, pp. 239–246

  2. [2]

    Capture Point: A Step toward Humanoid Push Recovery,

    J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture Point: A Step toward Humanoid Push Recovery,” in2006 6th IEEE-RAS International Conference on Humanoid Robots. University of Genova, Genova, Italy: IEEE, Dec. 2006, pp. 200–207

  3. [3]

    Optimization-based control for dynamic legged robots,

    P. M. Wensing, M. Posa, Y . Hu, A. Escande, N. Mansard, and A. D. Prete, “Optimization-based control for dynamic legged robots,”IEEE Transactions on Robotics, vol. 40, pp. 43–63, 2024

  4. [4]

    Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,

    A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle, “Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,”IEEE Transactions on Automatic Control, vol. 59, no. 4, pp. 876–891, Apr. 2014

  5. [5]

    Hybrid zero dynamics of planar biped walkers,

    E. Westervelt, J. Grizzle, and D. Koditschek, “Hybrid zero dynamics of planar biped walkers,”IEEE Transactions on Automatic Control, vol. 48, no. 1, pp. 42–56, Jan. 2003

  6. [6]

    E. R. Westervelt, J. W. Grizzle, C. Chevallereau, J. H. Choi, and B. Morris,Feedback Control of Dynamic Bipedal Robot Locomotion, 1st ed. CRC Press, Oct. 2018

  7. [7]

    Evolution of humanoid locomotion control,

    Y . Gu, G. Shi, F. Shi, I.-C. Chang, Y .-J. Wang, Q. Cheng, Z. Olkin, I. Lopez-Sanchez, Y . Feng, J. Zhang, A. Ames, H. Su, and K. Sreenath, “Evolution of humanoid locomotion control,” 2025, under review. [Online]. Available: https://www.thetracelab.com/uploads/1/1/ 3/0/113094493/evolution of humanoid locomotion control 1203.pdf

  8. [8]

    CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,

    K. Li, Z. Olkin, Y . Yue, and A. D. Ames, “CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3230–3237, Mar. 2026

  9. [9]

    Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,

    Z. Olkin, K. Li, W. D. Compton, and A. D. Ames, “Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,” Sep. 2025

  10. [10]

    A ‘universal’construction of artstein’s theorem on nonlinear stabilization,

    E. D. Sontag, “A ‘universal’construction of artstein’s theorem on nonlinear stabilization,”Systems & control letters, vol. 13, no. 2, pp. 117–123, 1989

  11. [11]

    Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,

    K. Galloway, K. Sreenath, A. D. Ames, and J. W. Grizzle, “Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,”IEEE Access, vol. 3, 2015

  12. [12]

    ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,

    J. P. Sleiman, H. Li, A. Adu-Bredu, R. Deits, A. Kumaret al., “ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,” Jan. 2026

  13. [13]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,

    Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu, “BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,” Aug. 2025

  14. [14]

    Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,

    Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames, “Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,” 2026. [Online]. Available: https://arxiv.org/abs/2603.25902

  15. [15]

    Principled reward shaping for re- inforcement learning via lyapunov stability theory,

    Y . Dong, X. Tang, and Y . Yuan, “Principled reward shaping for re- inforcement learning via lyapunov stability theory,”Neurocomputing, vol. 393, pp. 83–90, Jun. 2020

  16. [16]

    Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,

    T. Westenbroek, F. Castaneda, A. Agrawal, S. Sastry, and K. Sreenath, “Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,” Nov. 2022

  17. [17]

    Nonlinear Optimal Control: A Receding Horizon Ap- proach,

    J. A. Primbs, “Nonlinear Optimal Control: A Receding Horizon Ap- proach,” Ph.D. dissertation, California Institute of Technology, 1999

  18. [18]

    Falcone and R

    M. Falcone and R. Ferretti,Semi-LagrangianApproximation Schemes- for Linear andHamilton–Jacobi Equations. SIAM, 2014

  19. [19]

    Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,

    M. Diehl, H. Bock, H. Diedam, and P.-B. Wieber, “Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,” inFast Motions in Biomechanics and Robotics, M. Diehl and K. Mombaur, Eds. Springer Berlin Heidelberg, 2006, vol. 340, pp. 65–93

  20. [20]

    Reinforcement learning,

    R. S. Sutton, A. G. Bartoet al., “Reinforcement learning,”Journal of Cognitive Neuroscience, vol. 11, no. 1, pp. 126–134, 1999

  21. [21]

    Relaxed iss small-gain theorems for discrete-time systems,

    R. Geiselhart and F. R. Wirth, “Relaxed iss small-gain theorems for discrete-time systems,”SIAM Journal on Control and Optimization, vol. 54, no. 2, pp. 423–449, 2016

  22. [22]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    M. Mittal, P. Roth, J. Tigue, A. Richardet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025

  23. [23]

    Learning to walk in minutes using massively parallel deep reinforcement learning,

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100