Recognition: unknown
Lyapunov-Certified Direct Switching Theory for Q-Learning
Pith reviewed 2026-05-10 02:18 UTC · model grok-4.3
The pith
Q-learning error dynamics reduce to a switched linear system whose rate is the joint spectral radius of the switches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that standard Q-learning admits a direct stochastic switching-system representation in which the conditional-mean drift is a switched linear system, and the leading exponential convergence rate is given exactly by the joint spectral radius of that switching family. Finite-time bounds are obtained from a product-defined JSR-induced Lyapunov function that applies to the full family when the JSR is less than one, while an optional common quadratic Lyapunov certificate yields simpler bounds via a matrix inequality when it is feasible; the framework also extends to Markovian observation models.
What carries the argument
The direct switching-system representation of the Q-learning error, which encodes the Bellman maximization as an exact convex combination of action-wise errors under a stochastic policy and produces a switched linear drift with martingale noise.
If this is right
- The JSR supplies the exact worst-case exponential rate for the switched linear drift in this Q-learning representation.
- Finite-time bounds follow directly from the product-defined JSR-induced Lyapunov function whenever the JSR is below one.
- When a common quadratic Lyapunov certificate exists, it replaces product-based verification by a single matrix inequality and yields a simpler stochastic bound.
- The same switching representation and rate analysis carry over to Markovian observation models.
Where Pith is reading between the lines
- If the JSR can be bounded or approximated for concrete MDPs, the bounds could be used to select exploration policies that keep the radius small and thereby speed up learning.
- The same direct-switching construction may extend to other value-iteration methods that rely on maximization, such as SARSA or actor-critic variants.
- Practical computation of the JSR via existing numerical tools would turn the theoretical rate into a verifiable certificate for moderate-sized problems.
Load-bearing premise
The Bellman maximization error can be expressed exactly as an average of action-wise Q-errors under a suitable stochastic policy.
What would settle it
Compute the joint spectral radius for the switching matrices arising from a simple finite MDP; if the radius is less than one yet repeated runs of standard Q-learning on that MDP fail to exhibit the predicted exponential decay rate, the representation or rate claim is refuted.
read the original abstract
Q-learning is a fundamental algorithmic primitive in reinforcement learning. This paper develops a new framework for analyzing Q-learning from a switching-system viewpoint. In particular, we derive a direct stochastic switching-system representation of the Q-learning error. The key observation is that the Bellman maximization error can be expressed exactly as an average of action-wise Q-errors under a suitable stochastic policy. The resulting recursion has a switched linear conditional-mean drift and martingale-difference noise. To the best of our knowledge, this is the first convergence-rate analysis of standard Q-learning whose leading exponential rate is expressed through the joint spectral radius (JSR) of a direct switching family. Since the JSR is the exact worst-case exponential rate of the associated switched linear drift, the resulting rate is among the tightest drift-based rates that can be certified for this Q-learning representation. Building on this representation, we prove finite-time bounds based on a product-defined JSR-induced Lyapunov function and also give an optional common quadratic Lyapunov certificate. The quadratic certificate is only a sufficient condition and hence applies only to instances for which the certificate is feasible, whereas the JSR-induced Lyapunov construction applies to the full direct switching family whenever its JSR is below one. When feasible, the quadratic certificate replaces product-based verification by a computable matrix inequality and gives a simpler stochastic bound. We further extend the framework to Markovian observation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a switching-system framework for analyzing the convergence of standard Q-learning. It claims that the Q-learning error recursion admits a direct stochastic switched-linear representation whose conditional-mean drift is governed by a family of matrices whose joint spectral radius (JSR) supplies the leading exponential rate; finite-time bounds are then obtained from a product-defined JSR-induced Lyapunov function (with an optional common quadratic certificate when feasible). The framework is extended to Markovian observation models.
Significance. If the central representation holds exactly, the work would supply the first JSR-based rate for vanilla Q-learning and a Lyapunov certificate that is tight for the switched drift whenever the JSR is less than one. The explicit separation of the quadratic certificate (sufficient but computable) from the JSR construction (necessary and sufficient for the family) is a useful distinction. Reproducible code or machine-checked proofs are not mentioned.
major comments (2)
- [Abstract / §3 (key identity)] Abstract and the derivation of the switched representation (presumably §3): the claim that the Bellman maximization error equals exactly the average of action-wise Q-errors under a suitable stochastic policy π is load-bearing for the entire switched-linear claim. The identity max(Q* + e) − max Q* = E_π[e] + (E_π[Q*] − max Q*) contains a non-positive bias term E_π[Q*] − max Q* that vanishes only when π is supported on an optimal action for Q*. When the current greedy action differs from an optimal action, the bias is strictly negative and state-dependent, rendering the conditional drift piecewise affine rather than linear in e. Consequently the recursion cannot be written as a switched linear system driven by a fixed matrix family plus martingale-difference noise, and the JSR of any fixed family does not govern the exponential rate.
- [§4, §5] §4 (finite-time bounds) and §5 (quadratic certificate): the JSR-induced Lyapunov function and the matrix-inequality certificate both presuppose the switched-linear drift derived in §3. If the bias term identified above is present, the product-based bounds and the LMI feasibility condition no longer apply to the actual Q-learning process; the manuscript should either derive the exact (biased) recursion or restrict the claim to the special case in which the bias vanishes identically.
minor comments (2)
- [§3] Notation for the stochastic policy π that realizes the averaging should be introduced explicitly when the key identity is stated, together with the precise support condition that makes the bias zero.
- [§6] The extension to Markovian observations (§6) inherits the same representation; any correction to the core identity must be propagated there as well.
Simulated Author's Rebuttal
We thank the referee for the careful and insightful review. The comments correctly identify a subtlety in the key identity used to obtain the switched-linear representation. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Abstract / §3 (key identity): the claim that the Bellman maximization error equals exactly the average of action-wise Q-errors under a suitable stochastic policy π is load-bearing for the entire switched-linear claim. The identity max(Q* + e) − max Q* = E_π[e] + (E_π[Q*] − max Q*) contains a non-positive bias term E_π[Q*] − max Q* that vanishes only when π is supported on an optimal action for Q*. When the current greedy action differs from an optimal action, the bias is strictly negative and state-dependent, rendering the conditional drift piecewise affine rather than linear in e. Consequently the recursion cannot be written as a switched linear system driven by a fixed matrix family plus martingale-difference noise, and the JSR of any fixed family does not govern the exponential rate.
Authors: We thank the referee for this precise observation. The identity indeed includes the bias term (E_π[Q*] − max Q*) whenever π is supported on the argmax of the perturbed values Q* + e but not on the argmax of Q* alone. This renders the conditional drift piecewise affine. In the manuscript the “suitable stochastic policy” is chosen to realize the max operator exactly, so the representation is exact only when the argmax actions coincide. We will revise §3 to state the identity with the bias term made explicit, derive the resulting recursion, and show that the bias is a state-dependent but bounded perturbation whose effect on the leading exponential rate can be absorbed into the JSR analysis (the JSR of the linear part remains an upper bound on the contraction rate). The abstract will be updated to reflect the precise statement. This preserves the core JSR-based rate while correcting the claim of exact linearity. revision: partial
-
Referee: §4, §5: the JSR-induced Lyapunov function and the matrix-inequality certificate both presuppose the switched-linear drift derived in §3. If the bias term identified above is present, the product-based bounds and the LMI feasibility condition no longer apply to the actual Q-learning process; the manuscript should either derive the exact (biased) recursion or restrict the claim to the special case in which the bias vanishes identically.
Authors: We agree that the finite-time bounds and quadratic certificate in §§4–5 rest on the form of the drift. With the revised recursion in §3 that includes the bias, we will update the proofs in §4 to treat the bias as an additive perturbation whose magnitude is controlled by the current error norm; the product-defined JSR-induced Lyapunov function remains valid and yields finite-time bounds that differ from the original only by an additive term that vanishes exponentially. For the common quadratic certificate in §5 we will either restrict its applicability to the bias-free case or supply a modified LMI that accounts for the worst-case bias. These changes will be incorporated in the revised manuscript. revision: yes
Circularity Check
No circularity: derivation applies JSR to explicitly constructed switched representation
full rationale
The paper constructs switching matrices directly from the Q-learning update rule and the claimed exact averaging of Bellman error under a stochastic policy, then invokes the mathematical definition of JSR as the worst-case growth rate of the resulting switched linear system. This is a standard application of an external tool to a derived model rather than a reduction of the claimed rate to itself by construction. No parameters are fitted to data and relabeled as predictions, no self-citations justify core uniqueness or ansatzes, and the central bound follows from the representation's validity (argued via the key observation) without tautological redefinition. The analysis remains self-contained against external benchmarks such as the definition of JSR.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The MDP has finite state and action spaces and the Q-learning update uses a suitable stochastic policy that makes the Bellman error an exact average of action-wise errors.
Forward citations
Cited by 2 Pith papers
-
Switching-Geometry Analysis of Deflated Q-Value Iteration
Deflated Q-VI is algebraically equivalent to recentering standard Q-VI, yet its error dynamics are governed by the joint spectral radius of a projected switching system that can be strictly smaller than the discount factor γ.
-
A Switching System Theory of Q-Learning with Linear Function Approximation
Q-learning with linear function approximation is recast as a switched linear system whose mean dynamics converge precisely when the joint spectral radius of the switching matrices is less than one.
Reference graph
Works this paper leans on
-
[1]
Error bounds for constant step-size Q-learning
Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208, 2012
2012
-
[2]
Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005
Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005
2005
-
[3]
The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000
Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000
2000
-
[4]
SIAM, Philadelphia, PA, 1994
Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan.Linear Matrix Inequalities in System and Control Theory, volume 15 ofStudies in Applied Mathematics. SIAM, Philadelphia, PA, 1994
1994
-
[5]
A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants
Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants.arXiv preprint arXiv:2102.01567, 2021
-
[6]
Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003
Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003
2003
-
[7]
A first-order approach to accelerated value iteration
Vineet Goyal and Julien Grand-Clement. A first-order approach to accelerated value iteration. Operations research, 71(2):517–535, 2023
2023
-
[8]
Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE transactions on automatic control, 56 (5):1059–1074, 2010
Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE transactions on automatic control, 56 (5):1059–1074, 2010
2010
-
[9]
Resilient stabilization of switched linear control systems against adversarial switching.IEEE Transactions on Automatic Control, 62(8): 3820–3834, 2017
Jianghai Hu, Jinglai Shen, and Donghwan Lee. Resilient stabilization of switched linear control systems against adversarial switching.IEEE Transactions on Automatic Control, 62(8): 3820–3834, 2017
2017
-
[10]
Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993
1993
-
[11]
Springer Science & Business Media, 2009
Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009
2009
-
[12]
Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998
Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998
1998
-
[13]
George Yin.Stochastic approximation and recursive algorithms and applications, volume 35
Harold Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003
2003
-
[14]
Final iteration convergence bound of Q-learning: Switching system approach
Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach. IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024
2024
-
[15]
A unified switching system perspective and convergence analysis of Q-learning algorithms
Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. In34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020
2020
-
[16]
A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023
Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023
2023
-
[17]
Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021
2021
-
[18]
Springer Science & Business Media, 2003
Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003
2003
-
[19]
Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024
Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024
2024
-
[20]
Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009
Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009. 10
2009
-
[21]
Puterman.Markov decision processes: Discrete stochastic dynamic programming
Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014
2014
-
[22]
Finite-time analysis of asynchronous stochastic approxima- tion and Q-learning
Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion and Q-learning. InConference on learning theory, pages 3185–3205, 2020
2020
-
[23]
A note on the joint spectral radius.Indag
Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960
1960
-
[24]
Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007
Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007
2007
-
[25]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT Press, 1998
1998
-
[26]
The asymptotic convergence-rate of Q-learning
Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998
1998
-
[27]
Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994
John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994
1994
-
[28]
John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997
1997
-
[29]
Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharp l∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019
-
[30]
Q-learning.Machine learning, 8(3):279–292, 1992
Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992. A Proofs Deferred from the Main Text A.1 Boundedness and noise variance Lemma 2(Boundedness and noise variance; cf. [ 6, 27, 29]).Under Assumption 1, let {Fk}k≥0 be the natural filtration defined in Section 2.4. In particular,Q0 is deterministic. The Q-learning itera...
1992
-
[31]
For everyt≥0, V t+1 ε (x)≥ ∥x∥ 2 2 +β −2 ε max i∈{1,...,M} V t ε (Aix),∀x∈R m.(13)
-
[32]
For everyt≥0, everyx∈R m, and everyλ∈R, V t ε (λx) =|λ| 2V t ε (x), and V t ε (x)≤V t+1 ε (x)
-
[33]
There exists a constantC ε >0such that ∥x∥2 2 ≤V t ε (x)≤C ε∥x∥2 2,∀x∈R m,∀t≥0.(14)
-
[34]
Moreover, ∥x∥2 2 ≤V ∞ ε (x)≤C ε∥x∥2 2,∀x∈R m.(15) 17
For everyx∈R m, the limit V ∞ ε (x) := lim t→∞ V t ε (x) exists and is finite. Moreover, ∥x∥2 2 ≤V ∞ ε (x)≤C ε∥x∥2 2,∀x∈R m.(15) 17
-
[35]
The function pε(x) := p V ∞ε (x) is a norm onR m
-
[36]
The functionV ∞ ε satisfies the stronger Lyapunov inequality V ∞ ε (Aix)≤β 2 ε V ∞ ε (x)− ∥x∥ 2 2 ≤β 2 ε V ∞ ε (x),∀x∈R m,∀i∈ {1, . . . , M}. Equivalently, pε(Aix)≤β εpε(x),∀x∈R m,∀i∈ {1, . . . , M}
-
[37]
, M}, where∥A i∥pε is the induced matrix norm generated byp ε, ∥Ai∥pε := sup x̸=0 pε(Aix) pε(x)
Consequently, ∥Ai∥pε ≤β ε,∀i∈ {1, . . . , M}, where∥A i∥pε is the induced matrix norm generated byp ε, ∥Ai∥pε := sup x̸=0 pε(Aix) pε(x) . Therefore, the induced-norm bound yields ρjsr(H)≤β ε. Proof.We prove the statements one by one. Proof of 1).By definition, V t+1 ε (x) = t+1X k=0 β−2k ε max σ∈{1,...,M} k ∥Aσx∥2 2. The k= 0 term is ∥x∥2
-
[38]
A word of length j+ 1 may be described by its first applied index i∈ {1,
For k≥1 , write k=j+ 1 . A word of length j+ 1 may be described by its first applied index i∈ {1, . . . , M} and the remaining word τ∈ {1, . . . , M} j. Therefore, decomposing by the first applied index gives V t+1 ε (x) =∥x∥ 2 2 + tX j=0 β−2(j+1) ε max i∈{1,...,M} max τ∈{1,...,M} j ∥Aτ Aix∥2 2 =∥x∥ 2 2 +β −2 ε tX j=0 β−2j ε max i∈{1,...,M} max τ∈{1,...,M...
-
[39]
22 Taking total expectation gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax, a k :=E[V ∞ ε (ek)]
Hence, combining Equation (25) and Equation (26) yields E[V ∞ ε (ek+1)| F k]≤β 2 ε V ∞ ε (ek) +α 2C2 ε Wmax. 22 Taking total expectation gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax, a k :=E[V ∞ ε (ek)]. Iterating this recursion yields ak ≤β 2k ε a0 +α 2C2 ε Wmax k−1X j=0 β2j ε .(27) Since k−1X j=0 β2j ε = 1−β 2k ε 1−β 2ε ,(28) combining Equati...
-
[40]
Taking total expectation and setting ak :=E[V ∞ ε (ek)] gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax + (4(1 +γ)B Q)2
Hence, it follows from Equation (56) that E[V ∞ ε (ek+1)| F M k ]≤β 2 ε V ∞ ε (ek) +α 2Cε(Cε −1) Wmax + (4(1 +γ)B Q)2 +α 2Cε Wmax + (4(1 +γ)B Q)2 =β 2 ε V ∞ ε (ek) +α 2C2 ε Wmax + (4(1 +γ)B Q)2 . Taking total expectation and setting ak :=E[V ∞ ε (ek)] gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax + (4(1 +γ)B Q)2 . Iterating it gives ak ≤β 2k ε a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.