Recognition: 3 theorem links
· Lean TheoremStability of Control Lyapunov Function Guided Reinforcement Learning
Pith reviewed 2026-05-08 19:30 UTC · model grok-4.3
The pith
Reinforcement learning policies shaped by control Lyapunov functions achieve exponential stability in continuous and discrete time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Viewing the RL problem as an optimal control problem, the authors prove that the optimal controller obtained from a CLF-shaped reward renders the closed-loop dynamics exponentially stable. The proof covers both continuous and discrete time and extends to the full reward functions employed in practice, which include extra terms beyond the basic CLF components.
What carries the argument
The control Lyapunov function (CLF) used to synthesize the reinforcement learning reward function, which encodes a stability condition into the optimization objective solved by RL.
Load-bearing premise
A valid control Lyapunov function must exist for the system, and the reinforcement learning procedure must converge to the optimal policy induced by the shaped reward.
What would settle it
A counterexample would be a dynamical system possessing a known CLF where the policy learned from the CLF-shaped reward fails to produce exponential convergence to the equilibrium.
Figures
read the original abstract
Reinforcement learning (RL) has become the de facto method for achieving locomotion on humanoid robots in practice, yet stability analysis of the corresponding control policies is lacking. Recent work has attempted to merge control theoretic ideas with reinforcement learning through control guided learning. A notable example of this is the use of a control Lyapunov function (CLF) to synthesize the reinforcement learning rewards, a technique known as CLF-RL, which has shown practical success. This paper investigates the stability properties of optimal controllers using CLF-RL with the goal of bridging experimentally observed stability with theoretical guarantees. The RL problem is viewed as an optimal control problem and exponential stability is proven in both continuous and discrete time using both core CLF reward terms and the additional terms used in practice. The theoretical bounds are numerically verified on systems such as the double integrator and cart-pole. Finally, the CLF guided rewards are implemented for a walking humanoid robot to generate stable periodic orbits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper views CLF-RL as an optimal control problem whose reward is shaped by a control Lyapunov function. It claims to prove exponential stability of the resulting optimal policy in both continuous and discrete time, for both the core CLF reward terms and the additional practical terms. The theoretical bounds are numerically verified on the double integrator and cart-pole; the same reward construction is then implemented on a humanoid robot to produce stable periodic walking orbits.
Significance. If the stability proofs are correct, the work supplies a useful bridge between control-theoretic guarantees and the CLF-shaped rewards already used in practice for locomotion. The explicit reduction to optimal control and the use of standard CLF decrease conditions to obtain exponential bounds are strengths, as is the numerical check on standard benchmark systems. The practical humanoid demonstration shows the method is implementable, but the overall significance is tempered by the lack of any suboptimality or robustness margin for the approximate policies that RL actually produces.
major comments (2)
- [stability proofs (continuous and discrete time)] The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.
- [humanoid robot implementation] Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.
minor comments (2)
- The assumptions required for the stability theorems (existence of a valid CLF, controllability, etc.) should be collected in a single, explicit list or theorem statement rather than scattered through the text.
- Notation: the value function induced by the CLF reward is easily confused with the standard RL value function; a short clarifying remark or distinct symbol would help.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, agreeing to revisions that clarify the scope of the results without altering the core claims.
read point-by-point responses
-
Referee: The exponential-stability claims (both continuous- and discrete-time) are derived under the assumption that the learned policy exactly equals the optimal policy induced by the CLF-shaped reward. No suboptimality bound, robustness margin, or Lyapunov decrease condition under policy approximation error is provided. This is load-bearing for the central claim that CLF-RL itself is stable, because RL algorithms converge only approximately.
Authors: The theoretical results establish exponential stability for the exact optimal policy induced by the CLF-shaped reward, with this assumption stated explicitly in the problem formulation and theorem statements. We do not derive suboptimality or robustness margins for approximate policies, as these would require additional assumptions on the RL algorithm and are beyond the paper's scope. We will revise the abstract, introduction, and conclusion to emphasize that the guarantees apply strictly to the optimal policy and to note the distinction from practical RL approximations. revision: yes
-
Referee: Humanoid walking example: only implementation and visual inspection of periodic orbits are reported. No verification is given that the learned policy remains inside the region of attraction or satisfies the Lyapunov decrease condition derived for the exact optimum.
Authors: The humanoid demonstration illustrates the practical use of the CLF-guided reward to produce stable periodic walking via visual confirmation of orbits. Because the exact optimal policy cannot be computed for this high-dimensional system, direct verification of the Lyapunov decrease condition or region of attraction on the hardware is not possible. We will revise the relevant section to clarify that the example is an empirical illustration of the reward design rather than a formal verification of the theoretical bounds. revision: yes
Circularity Check
No circularity: stability derived from standard CLF and optimal-control theorems
full rationale
The derivation views the CLF-shaped RL problem as an optimal control problem and invokes standard Lyapunov decrease conditions plus optimality to prove exponential stability for both core and practical reward terms in continuous and discrete time. This rests on external definitions of CLFs and value functions rather than any fitted quantity, self-defined term, or self-citation chain that reduces the claim to its own inputs. Benchmark simulations and the humanoid implementation serve as external checks. No load-bearing step matches the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A control Lyapunov function exists for the dynamical system under consideration
- domain assumption The reinforcement learning algorithm converges to the optimal policy defined by the CLF-shaped reward
Lean theorems connected to this paper
-
IndisputableMonolith/Cost (J-cost)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ℓ(x, u) := βV(x) + ρ[V̇(x, u) + λV(x)]₊ ... J*(x) = inf_π J_π(x)
-
Foundation.LogicAsFunctionalEquation(no analogue: V is assumed, not forced) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
c₁‖x‖² ≤ V(x) ≤ c₂‖x‖², ∇V(x)ᵀ(f(x)+g(x)μ(x)) ≤ −αV(x)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,
S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa, “The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,” inProceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1. Maui, HI, USA: IEEE, 2001, pp. 239–246
2001
-
[2]
Capture Point: A Step toward Humanoid Push Recovery,
J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture Point: A Step toward Humanoid Push Recovery,” in2006 6th IEEE-RAS International Conference on Humanoid Robots. University of Genova, Genova, Italy: IEEE, Dec. 2006, pp. 200–207
2006
-
[3]
Optimization-based control for dynamic legged robots,
P. M. Wensing, M. Posa, Y . Hu, A. Escande, N. Mansard, and A. D. Prete, “Optimization-based control for dynamic legged robots,”IEEE Transactions on Robotics, vol. 40, pp. 43–63, 2024
2024
-
[4]
Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,
A. D. Ames, K. Galloway, K. Sreenath, and J. W. Grizzle, “Rapidly Exponentially Stabilizing Control Lyapunov Functions and Hybrid Zero Dynamics,”IEEE Transactions on Automatic Control, vol. 59, no. 4, pp. 876–891, Apr. 2014
2014
-
[5]
Hybrid zero dynamics of planar biped walkers,
E. Westervelt, J. Grizzle, and D. Koditschek, “Hybrid zero dynamics of planar biped walkers,”IEEE Transactions on Automatic Control, vol. 48, no. 1, pp. 42–56, Jan. 2003
2003
-
[6]
E. R. Westervelt, J. W. Grizzle, C. Chevallereau, J. H. Choi, and B. Morris,Feedback Control of Dynamic Bipedal Robot Locomotion, 1st ed. CRC Press, Oct. 2018
2018
-
[7]
Evolution of humanoid locomotion control,
Y . Gu, G. Shi, F. Shi, I.-C. Chang, Y .-J. Wang, Q. Cheng, Z. Olkin, I. Lopez-Sanchez, Y . Feng, J. Zhang, A. Ames, H. Su, and K. Sreenath, “Evolution of humanoid locomotion control,” 2025, under review. [Online]. Available: https://www.thetracelab.com/uploads/1/1/ 3/0/113094493/evolution of humanoid locomotion control 1203.pdf
2025
-
[8]
CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,
K. Li, Z. Olkin, Y . Yue, and A. D. Ames, “CLF-RL: Control Lya- punov Function Guided Reinforcement Learning,”IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 3230–3237, Mar. 2026
2026
-
[9]
Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,
Z. Olkin, K. Li, W. D. Compton, and A. D. Ames, “Chasing Stability: Humanoid Running via Control Lyapunov Function Guided Reinforce- ment Learning,” Sep. 2025
2025
-
[10]
A ‘universal’construction of artstein’s theorem on nonlinear stabilization,
E. D. Sontag, “A ‘universal’construction of artstein’s theorem on nonlinear stabilization,”Systems & control letters, vol. 13, no. 2, pp. 117–123, 1989
1989
-
[11]
Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,
K. Galloway, K. Sreenath, A. D. Ames, and J. W. Grizzle, “Torque Saturation in Bipedal Robotic Walking Through Control Lyapunov Function-Based Quadratic Programs,”IEEE Access, vol. 3, 2015
2015
-
[12]
ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,
J. P. Sleiman, H. Li, A. Adu-Bredu, R. Deits, A. Kumaret al., “ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control,” Jan. 2026
2026
-
[13]
BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,
Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu, “BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion,” Aug. 2025
2025
-
[14]
Z. Olkin, W. D. Compton, R. M. Bena, and A. D. Ames, “Chasing autonomy: Dynamic retargeting and control guided rl for performant and controllable humanoid running,” 2026. [Online]. Available: https://arxiv.org/abs/2603.25902
-
[15]
Principled reward shaping for re- inforcement learning via lyapunov stability theory,
Y . Dong, X. Tang, and Y . Yuan, “Principled reward shaping for re- inforcement learning via lyapunov stability theory,”Neurocomputing, vol. 393, pp. 83–90, Jun. 2020
2020
-
[16]
Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,
T. Westenbroek, F. Castaneda, A. Agrawal, S. Sastry, and K. Sreenath, “Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning,” Nov. 2022
2022
-
[17]
Nonlinear Optimal Control: A Receding Horizon Ap- proach,
J. A. Primbs, “Nonlinear Optimal Control: A Receding Horizon Ap- proach,” Ph.D. dissertation, California Institute of Technology, 1999
1999
-
[18]
Falcone and R
M. Falcone and R. Ferretti,Semi-LagrangianApproximation Schemes- for Linear andHamilton–Jacobi Equations. SIAM, 2014
2014
-
[19]
Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,
M. Diehl, H. Bock, H. Diedam, and P.-B. Wieber, “Fast Direct Multiple Shooting Algorithms for Optimal Robot Control,” inFast Motions in Biomechanics and Robotics, M. Diehl and K. Mombaur, Eds. Springer Berlin Heidelberg, 2006, vol. 340, pp. 65–93
2006
-
[20]
Reinforcement learning,
R. S. Sutton, A. G. Bartoet al., “Reinforcement learning,”Journal of Cognitive Neuroscience, vol. 11, no. 1, pp. 126–134, 1999
1999
-
[21]
Relaxed iss small-gain theorems for discrete-time systems,
R. Geiselhart and F. R. Wirth, “Relaxed iss small-gain theorems for discrete-time systems,”SIAM Journal on Control and Optimization, vol. 54, no. 2, pp. 423–449, 2016
2016
-
[22]
Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning
M. Mittal, P. Roth, J. Tigue, A. Richardet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Learning to walk in minutes using massively parallel deep reinforcement learning,
N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.