arxiv: 2605.01978 · v2 · submitted 2026-05-03 · 📡 eess.SY · cs.RO· cs.SY

Recognition: 3 theorem links

· Lean Theorem

Stability of Control Lyapunov Function Guided Reinforcement Learning

Zachary Olkin , William D. Compton , Aaron D. Ames

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3

classification 📡 eess.SY cs.ROcs.SY

keywords reinforcement learningcontrol Lyapunov functionexponential stabilityreward shapinghumanoid robotoptimal controllocomotion

0 comments

The pith

Reinforcement learning policies shaped by control Lyapunov functions achieve exponential stability in continuous and discrete time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that shaping reinforcement learning rewards with a control Lyapunov function produces optimal policies that exponentially stabilize the underlying dynamical system. This result is shown for both the core CLF reward components and the additional terms used in real implementations, holding in continuous-time and discrete-time settings. Numerical checks on simple systems like the double integrator and cart-pole match the derived bounds. The approach is then applied to generate stable walking on a humanoid robot. If correct, this provides theoretical backing for why CLF-guided RL succeeds in practice on complex robots.

Core claim

Viewing the RL problem as an optimal control problem, the authors prove that the optimal controller obtained from a CLF-shaped reward renders the closed-loop dynamics exponentially stable. The proof covers both continuous and discrete time and extends to the full reward functions employed in practice, which include extra terms beyond the basic CLF components.

What carries the argument

The control Lyapunov function (CLF) used to synthesize the reinforcement learning reward function, which encodes a stability condition into the optimization objective solved by RL.

Load-bearing premise

A valid control Lyapunov function must exist for the system, and the reinforcement learning procedure must converge to the optimal policy induced by the shaped reward.

What would settle it

A counterexample would be a dynamical system possessing a known CLF where the policy learned from the CLF-shaped reward fails to produce exponential convergence to the equilibrium.

Figures

Figures reproduced from arXiv: 2605.01978 by Aaron D. Ames, William D. Compton, Zachary Olkin.

**Figure 1.** Figure 1: The main ideas behind CLF guided RL. A CLF is designed offline view at source ↗

**Figure 2.** Figure 2: Theoretical bounds plotted against the numerical solution to the optimal control problem for the continuous time double integrator with the view at source ↗

**Figure 3.** Figure 3: The optimal control of the continuous time cart-pole system using the CLF-RL costs plotted with the theoretical bounds. The cart-pole system view at source ↗

**Figure 4.** Figure 4: Theoretical bounds and the numerical solution to the discrete time double integrator system. The optimal control problem was solved numerically, view at source ↗

**Figure 5.** Figure 5: Results of the double integrator control policy trained with RL. view at source ↗

**Figure 6.** Figure 6: CLF-RL applied to a humanoid robot. Rather than tracking a single view at source ↗

read the original abstract

Reinforcement learning (RL) has become the de facto method for achieving locomotion on humanoid robots in practice, yet stability analysis of the corresponding control policies is lacking. Recent work has attempted to merge control theoretic ideas with reinforcement learning through control guided learning. A notable example of this is the use of a control Lyapunov function (CLF) to synthesize the reinforcement learning rewards, a technique known as CLF-RL, which has shown practical success. This paper investigates the stability properties of optimal controllers using CLF-RL with the goal of bridging experimentally observed stability with theoretical guarantees. The RL problem is viewed as an optimal control problem and exponential stability is proven in both continuous and discrete time using both core CLF reward terms and the additional terms used in practice. The theoretical bounds are numerically verified on systems such as the double integrator and cart-pole. Finally, the CLF guided rewards are implemented for a walking humanoid robot to generate stable periodic orbits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

CLF-RL yields provably exponentially stable optimal controllers in both continuous and discrete time, but the paper stops short of showing stability under the approximate policies that RL actually produces. The new contribution is the explicit reduction of the CLF-shaped RL problem to optimal control, followed by a direct proof that the resulting optimal policy satisfies exponential stability using the CLF decrease property. This covers both the core reward terms and the additional practical terms, and the same argument is given for continuous and discrete time. The numerical checks on the double integrator and cart-pole systems confirm that the derived bounds hold in simulation, which is a straightforward and useful verification step. The humanoid walking implementation shows the method can be run on a real system. The main limitation is the absence of any suboptimality or robustness margin. The stability claim applies only to the exact optimum; there is no bound on how much policy error can be tolerated before the Lyapunov decrease fails. The robot example reports successful walking but does not check whether the learned policy remains inside the region where the stability guarantee applies. The proof also rests on the existence of a valid CLF and convergence of RL to the optimum, both of which are stated without further sensitivity analysis. This work is aimed at control and robotics researchers who already use or want to certify CLF-based reward shaping. It deserves peer review because the central derivation is clean and the practical context is relevant, even though revisions would be needed to address the gap between optimal and learned policies.

Referee Report

2 major / 2 minor

Summary. The paper views CLF-RL as an optimal control problem whose reward is shaped by a control Lyapunov function. It claims to prove exponential stability of the resulting optimal policy in both continuous and discrete time, for both the core CLF reward terms and the additional practical terms. The theoretical bounds are numerically verified on the double integrator and cart-pole; the same reward construction is then implemented on a humanoid robot to produce stable periodic walking orbits.

Significance. If the stability proofs are correct, the work supplies a useful bridge between control-theoretic guarantees and the CLF-shaped rewards already used in practice for locomotion. The explicit reduction to optimal control and the use of standard CLF decrease conditions to obtain exponential bounds are strengths, as is the numerical check on standard benchmark systems. The practical humanoid demonstration shows the method is implementable, but the overall significance is tempered by the lack of any suboptimality or robustness margin for the approximate policies that RL actually produces.

major comments (2)

[stability proofs (continuous and discrete time)] The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.
[humanoid robot implementation] Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.

minor comments (2)

The assumptions required for the stability theorems (existence of a valid CLF, controllability, etc.) should be collected in a single, explicit list or theorem statement rather than scattered through the text.
Notation: the value function induced by the CLF reward is easily confused with the standard RL value function; a short clarifying remark or distinct symbol would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing to revisions that clarify the scope of the results without altering the core claims.

read point-by-point responses

Referee: The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.

Authors: The theoretical results establish exponential stability for the exact optimal policy induced by the CLF-shaped reward, with this assumption stated explicitly in the problem formulation and theorem statements. We do not derive suboptimality or robustness margins for approximate policies, as these would require additional assumptions on the RL algorithm and are beyond the paper's scope. We will revise the abstract, introduction, and conclusion to emphasize that the guarantees apply strictly to the optimal policy and to note the distinction from practical RL approximations. revision: yes
Referee: Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.

Authors: The humanoid demonstration illustrates the practical use of the CLF-guided reward to produce stable periodic walking via visual confirmation of orbits. Because the exact optimal policy cannot be computed for this high-dimensional system, direct verification of the Lyapunov decrease condition or region of attraction on the hardware is not possible. We will revise the relevant section to clarify that the example is an empirical illustration of the reward design rather than a formal verification of the theoretical bounds. revision: yes

Circularity Check

0 steps flagged

No circularity: stability derived from standard CLF and optimal-control theorems

full rationale

The derivation views the CLF-shaped RL problem as an optimal control problem and invokes standard Lyapunov decrease conditions plus optimality to prove exponential stability for both core and practical reward terms in continuous and discrete time. This rests on external definitions of CLFs and value functions rather than any fitted quantity, self-defined term, or self-citation chain that reduces the claim to its own inputs. Benchmark simulations and the humanoid implementation serve as external checks. No load-bearing step matches the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a CLF for the plant and on convergence of RL to the optimal policy; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption A control Lyapunov function exists for the dynamical system under consideration
Invoked to construct the reward that guides the RL policy toward stability.
domain assumption The reinforcement learning algorithm converges to the optimal policy defined by the CLF-shaped reward
Required to equate the learned controller with the optimal controller whose stability is proven.

pith-pipeline@v0.9.0 · 5464 in / 1300 out tokens · 58724 ms · 2026-05-08T19:30:29.965944+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (J-cost) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ℓ(x, u) := βV(x) + ρ[V̇(x, u) + λV(x)]₊ ... J*(x) = inf_π J_π(x)
Foundation.LogicAsFunctionalEquation (no analogue: V is assumed, not forced) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

c₁‖x‖² ≤ V(x) ≤ c₂‖x‖², ∇V(x)ᵀ(f(x)+g(x)μ(x)) ≤ −αV(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 2 canonical work pages · 1 internal anchor

[1]

The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,

S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa, “The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,” inProceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1. Maui, HI, USA: IEEE, 2001, pp. 239–246

2001
[2]

Capture Point: A Step toward Humanoid Push Recovery,

J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture Point: A Step toward Humanoid Push Recovery,” in2006 6th IEEE-RAS International Conference on Humanoid Robots. University of Genova, Genova, Italy: IEEE, Dec. 2006, pp. 200–207

2006
[3]

Optimization-based control for dynamic legged robots,

P. M. Wensing, M. Posa, Y . Hu, A. Escande, N. Mansard, and A. D. Prete, “Optimization-based control for dynamic legged robots,”IEEE Transactions on Robotics, vol. 40, pp. 43–63, 2024

2024
[4]

Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,

A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle, “Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,”IEEE Transactions on Automatic Control, vol. 59, no. 4, pp. 876–891, Apr. 2014

2014
[5]

Hybrid zero dynamics of planar biped walkers,

E. Westervelt, J. Grizzle, and D. Koditschek, “Hybrid zero dynamics of planar biped walkers,”IEEE Transactions on Automatic Control, vol. 48, no. 1, pp. 42–56, Jan. 2003

2003
[6]

E. R. Westervelt, J. W. Grizzle, C. Chevallereau, J. H. Choi, and B. Morris,Feedback Control of Dynamic Bipedal Robot Locomotion, 1st ed. CRC Press, Oct. 2018

2018
[7]

Evolution of humanoid locomotion control,

Y . Gu, G. Shi, F. Shi, I.-C. Chang, Y .-J. Wang, Q. Cheng, Z. Olkin, I. Lopez-Sanchez, Y . Feng, J. Zhang, A. Ames, H. Su, and K. Sreenath, “Evolution of humanoid locomotion control,” 2025, under review. [Online]. Available: https://www.thetracelab.com/uploads/1/1/ 3/0/113094493/evolution of humanoid locomotion control 1203.pdf

2025
[8]

CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,

K. Li, Z. Olkin, Y . Yue, and A. D. Ames, “CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3230–3237, Mar. 2026

2026
[9]

Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,

Z. Olkin, K. Li, W. D. Compton, and A. D. Ames, “Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,” Sep. 2025

2025
[10]

A ‘universal’construction of artstein’s theorem on nonlinear stabilization,

E. D. Sontag, “A ‘universal’construction of artstein’s theorem on nonlinear stabilization,”Systems & control letters, vol. 13, no. 2, pp. 117–123, 1989

1989
[11]

Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,

K. Galloway, K. Sreenath, A. D. Ames, and J. W. Grizzle, “Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,”IEEE Access, vol. 3, 2015

2015
[12]

ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,

J. P. Sleiman, H. Li, A. Adu-Bredu, R. Deits, A. Kumaret al., “ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,” Jan. 2026

2026
[13]

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,

Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu, “BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,” Aug. 2025

2025
[14]

Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,

Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames, “Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,” 2026. [Online]. Available: https://arxiv.org/abs/2603.25902

work page arXiv 2026
[15]

Principled reward shaping for re- inforcement learning via lyapunov stability theory,

Y . Dong, X. Tang, and Y . Yuan, “Principled reward shaping for re- inforcement learning via lyapunov stability theory,”Neurocomputing, vol. 393, pp. 83–90, Jun. 2020

2020
[16]

Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,

T. Westenbroek, F. Castaneda, A. Agrawal, S. Sastry, and K. Sreenath, “Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,” Nov. 2022

2022
[17]

Nonlinear Optimal Control: A Receding Horizon Ap- proach,

J. A. Primbs, “Nonlinear Optimal Control: A Receding Horizon Ap- proach,” Ph.D. dissertation, California Institute of Technology, 1999

1999
[18]

Falcone and R

M. Falcone and R. Ferretti,Semi-LagrangianApproximation Schemes- for Linear andHamilton–Jacobi Equations. SIAM, 2014

2014
[19]

Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,

M. Diehl, H. Bock, H. Diedam, and P.-B. Wieber, “Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,” inFast Motions in Biomechanics and Robotics, M. Diehl and K. Mombaur, Eds. Springer Berlin Heidelberg, 2006, vol. 340, pp. 65–93

2006
[20]

Reinforcement learning,

R. S. Sutton, A. G. Bartoet al., “Reinforcement learning,”Journal of Cognitive Neuroscience, vol. 11, no. 1, pp. 126–134, 1999

1999
[21]

Relaxed iss small-gain theorems for discrete-time systems,

R. Geiselhart and F. R. Wirth, “Relaxed iss small-gain theorems for discrete-time systems,”SIAM Journal on Control and Optimization, vol. 54, no. 2, pp. 423–449, 2016

2016
[22]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richardet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review arXiv 2025
[23]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100

2022