Recognition: 2 theorem links
· Lean TheoremA Switching System Theory of Q-Learning with Linear Function Approximation
Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3
The pith
The mean dynamics of Q-learning with linear function approximation are exactly equivalent to a linear switched system whose stability determines convergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive an exact linear switched model for the mean dynamics of Q-learning with linear function approximation. Convergence of the iterates is related to the stability of the corresponding switched system, characterized by the joint spectral radius of the switching modes. The construction extends to stochastic Q-learning with i.i.d. and Markovian data, as well as to regularized Q-learning.
What carries the argument
The finite set of linear switching modes that exactly represent the mean update, with stability certified by their joint spectral radius.
If this is right
- Convergence holds if and only if the joint spectral radius of the mode set is strictly less than one.
- Certificates based on finite products of modes can be less conservative than one-step norm bounds.
- The same switched-system stability criterion applies to both i.i.d. and Markovian observation models.
- Regularized Q-learning with linear function approximation inherits the identical stability characterization.
Where Pith is reading between the lines
- If the switched model is faithful, standard tools from switched-system theory could be repurposed to select function approximators that guarantee a joint spectral radius below one.
- The explicit link to finite-difference stochastic-policy switching suggests that abrupt policy changes during learning appear as discrete mode jumps in parameter space.
- The parameter-space formulation may extend to other temporal-difference algorithms whose mean updates can be expressed as convex combinations of linear operators.
Load-bearing premise
The mean dynamics of Q-learning with linear function approximation admit an exact representation as a finite collection of linear switching modes.
What would settle it
A numerical simulation in which the Q-learning parameter trajectory deviates from the trajectory generated by the switched linear model under identical initial conditions, step-size schedule, and sample sequence.
Figures
read the original abstract
This paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system. The same construction is then used for stochastic linear Q-learning with independent and identically distributed (i.i.d.) observations and with Markovian observations. Although exact JSR computation is difficult in general, the certificate captures products of switching modes and can be less conservative than one-step norm bounds. The framework also yields a JSR-based view of regularized Q-learning with LFA. The resulting analysis connects projected Bellman equations, finite-difference stochastic-policy switching, and switched-system stability in a single parameter-space formulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). It derives an exact linear switched model for the mean dynamics of the iterates and relates convergence to stability of the switched system. The construction is applied to stochastic linear Q-learning under both i.i.d. and Markovian observations, and is extended to a JSR-based view of regularized Q-learning with LFA, connecting projected Bellman equations, finite-difference policy switching, and switched-system stability in parameter space.
Significance. If the exact switched-linear mean dynamics hold, the framework supplies a parameter-space formulation that unifies several strands of RL theory and can yield less conservative stability certificates than one-step norm bounds by considering products of switching modes. The explicit link to JSR theory is a strength when the modes are few and the JSR is computable, offering a concrete route to convergence guarantees for approximate Q-learning.
major comments (1)
- [Markovian observations derivation] The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.
minor comments (1)
- [JSR computation paragraph] The paper notes that exact JSR computation is difficult in general; a brief discussion of practical upper-bound methods (e.g., via semidefinite programming or finite products) would help readers assess the tightness of the resulting certificates.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The observation regarding the Markovian case identifies a genuine subtlety in the exactness of the switched-linear representation. We address this point directly below and outline the corresponding revisions.
read point-by-point responses
-
Referee: The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.
Authors: We agree with the referee that the conditional expectation E[TD error | theta_t] is not strictly a function of theta_t alone under Markovian observations, because the state distribution at time t depends on the full history of parameter values through the policy-induced chain. The manuscript's derivation conditions on theta_t while treating the observation distribution as determined by the current parameter; this step holds exactly only when observations are i.i.d. For the Markovian setting the resulting switched-linear model is therefore an approximation whose fidelity improves when parameter updates are slow relative to the mixing time of the underlying chain. We will revise the abstract, Section 4, and the discussion of convergence certificates to qualify the Markovian claim accordingly, stating that the JSR provides a useful but non-rigorous indicator rather than an exact certificate for the true mean-field trajectory. A brief remark on sufficient conditions for exactness (e.g., fixed theta or sufficiently slow variation) will also be added. revision: partial
Circularity Check
No circularity: switched model derived directly from update rule; JSR stability is external
full rationale
The paper constructs the linear switched representation by taking the conditional expectation of the Q-learning TD update under linear function approximation, yielding a finite collection of affine maps whose selection depends on the current parameter vector. This algebraic step follows immediately from the definition of the mean-field dynamics and does not presuppose the joint spectral radius result. Convergence is then related to stability of the switched system via the standard JSR criterion, which is imported as an independent mathematical tool rather than being fitted or redefined inside the paper. The same construction is applied to both i.i.d. and Markovian sampling without any self-referential closure or reduction of the claimed exactness to a fitted quantity. No load-bearing premise relies on self-citation chains or ansatzes smuggled from prior work by the same authors.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The joint spectral radius determines stability of a discrete-time linear switched system
- domain assumption The mean dynamics of Q-learning LFA can be written exactly as a finite collection of linear modes
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system... the Bellman optimality error induces a stochastic-policy-based switching system, and the stability object is the joint spectral radius (JSR) of the associated switching matrix family
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3 (Stochastic-policy linearization)... V_θ − V_¯θ = Π_μ Φ(θ − ¯θ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Error bounds for co nstant step-size Q-learning
Carolyn L Beck and Rayadurgam Srikant. Error bounds for co nstant step-size Q-learning. Systems & control letters , 61(12):1203–1208, 2012
work page 2012
-
[2]
Dynamic programming and optimal con trol 4th edition, volume ii
Dimitri P Bertsekas. Dynamic programming and optimal con trol 4th edition, volume ii. Athena Scientific , 2015
work page 2015
-
[3]
The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning
Vivek S Borkar and Sean P Meyn. The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning. SIAM Journal on Control and Optimization , 38(2):447–469, 2000
work page 2000
-
[4]
Diogo Carvalho, Francisco S. Melo, and Pedro Santos. A ne w convergent variant of Q-learning with linear function approximation. In Advances in Neural Information Processing Systems , volume 33, pages 19412–19421, 2020. 22
work page 2020
-
[5]
Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gumm adi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurman s. Target networks and over-parameterization stabilize off-policy bootstrappin g with function approximation. In Pro- ceedings of the 41st International Conference on Machine Le arning, volume 235 of Proceedings of M...
work page 2024
-
[6]
A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants
Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, an d Karthikeyan Shanmugam. A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants. arXiv preprint arXiv:2102.01567 , 2021
-
[7]
Doan, John-Paul Clark e, and Siva Theja Maguluri
Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clark e, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation wit h applications in reinforcement learn- ing. Automatica, 146:110623, 2022. doi: 10.1016/j.automatica.2022.1106 23
-
[8]
Target network and truncation overcome the deadly triad in Q-learning
Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science , 5(4): 1078–1101, 2023. doi: 10.1137/22M1499261
-
[9]
Learning rates for Q-l earning
Eyal Even-Dar and Yishay Mansour. Learning rates for Q-l earning. Journal of machine learning Research, 5(Dec):1–25, 2003
work page 2003
-
[10]
A first-order ap proach to accelerated value iteration
Vineet Goyal and Julien Grand-Clement. A first-order ap proach to accelerated value iteration. Operations research, 71(2):517–535, 2023
work page 2023
-
[11]
Generating fu nctions of switched linear systems: analysis, computation, and stability applications
Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating fu nctions of switched linear systems: analysis, computation, and stability applications. IEEE transactions on automatic control , 56 (5):1059–1074, 2010
work page 2010
-
[12]
The joint spectral radius: Theory and applications , volume 385
Raphaël Jungers. The joint spectral radius: Theory and applications , volume 385. Springer Science & Business Media, 2009
work page 2009
-
[13]
Finite-sample conv ergence rates for Q-learning and indirect algorithms
Michael Kearns and Satinder Singh. Finite-sample conv ergence rates for Q-learning and indirect algorithms. Advances in neural information processing systems , 11, 1998
work page 1998
-
[14]
Final iteration convergence bound of Q-l earning: Switching system approach
Donghwan Lee. Final iteration convergence bound of Q-l earning: Switching system approach. IEEE Transactions on Automatic Control , 69(7):4765–4772, 2024
work page 2024
-
[15]
Lyapunov-Certified Direct Switching Theory for Q-Learning
Donghwan Lee. Lyapunov-certified direct switching the ory for q-learning. arXiv preprint arXiv:2604.19569, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
A unified switching system pers pective and convergence analysis of Q-learning algorithms
Donghwan Lee and Niao He. A unified switching system pers pective and convergence analysis of Q-learning algorithms. In 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020
work page 2020
-
[17]
A discrete-time switching system analysis of Q- learning
Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q- learning. SIAM Journal on Control and Optimization , 61(3):1861–1880, 2023
work page 2023
-
[18]
Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Che n. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion. IEEE Transactions on Infor- mation Theory , 68(1):448–473, 2021
work page 2021
-
[19]
Switching in systems and control
Daniel Liberzon. Switching in systems and control . Springer Science & Business Media, 2003. 23
work page 2003
-
[20]
Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view. IEEE Access, 12:149916–149939, 2024
work page 2024
-
[21]
Han-Dong Lim and Donghwan Lee. Regularized q-learning . In Advances in Neural Information Processing Systems, volume 37, pages 129855–129887, 2024
work page 2024
-
[22]
Han-Dong Lim and Donghwan Lee. Understanding the theor etical properties of pro- jected bellman equation, linear q-learning, and approxima te value iteration. arXiv preprint arXiv:2504.10865, 2025
-
[23]
Stability and stabilizab ility of switched linear systems: A survey of recent results
Hai Lin and Panos J Antsaklis. Stability and stabilizab ility of switched linear systems: A survey of recent results. IEEE Transactions on Automatic control , 54(2):308–322, 2009
work page 2009
-
[24]
Linear Q-le arning does not diverge: Convergence rates to a bounded set
Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear Q-le arning does not diverge: Convergence rates to a bounded set. arXiv preprint arXiv:2501.19254 , 2025
-
[25]
Francisco S. Melo and M. Isabel Ribeiro. Convergence of Q-learning with linear function approximation. In 2007 European Control Conference (ECC) , pages 2671–2678. IEEE, 2007
work page 2007
-
[26]
Sean P. Meyn. The projected bellman equation in reinfor cement learning. IEEE Transactions on Automatic Control , 69(12):8323–8337, 2024. doi: 10.1109/TAC.2024.3409647
- [27]
-
[28]
Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning
Guannan Qu and Adam Wierman. Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning. In Conference on learning theory , pages 3185–3205, 2020
work page 2020
-
[29]
A note on the joint s pectral radius
Gian-Carlo Rota and Gilbert Strang. A note on the joint s pectral radius. Indag. Math , 22(4): 379–381, 1960
work page 1960
-
[30]
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction . MIT Press, 1998
work page 1998
-
[31]
The asymptotic convergence-rate of Q-learning
Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. In Advances in Neural Information Processing Systems , pages 1064–1070, 1998
work page 1998
-
[32]
Asynchronous stochastic approxima tion and Q-learning
John N Tsitsiklis. Asynchronous stochastic approxima tion and Q-learning. Machine learning , 16(3):185–202, 1994
work page 1994
-
[33]
John N Tsitsiklis and Vincent D Blondel. The lyapunov exp onent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute a nd to approximate. Mathematics of Control, Signals and Systems , 10(1):31–40, 1997
work page 1997
-
[34]
Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992
work page 1992
-
[35]
Br eaking the deadly triad with a target network
Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Br eaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machin e Learning , volume 139 of Proceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. 24 A Proof of the Equivalence Between the Projected Equation an d the Normal Equation The fir...
work page 2021
-
[36]
145θ, θ ≥ 0, − 0. 8θ, θ < 0. Consequently, θ⋆ = 0 is a projected Bellman fixed point. The two deterministic-po licy direct modes are A1 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·1 = 0 . 145, A2 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·(− 2) = − 0. 8. Now choose the initial parameter θ0 = − 2. The deterministic Q-learning trajectory is θ1 = Tα (− 2) = 1 . 6, θ 2 = T...
-
[37]
25 ≥ ρ(Aπ 1) = 1 . 001684277044667 > 1. Therefore, the JSR sufficient condition ρdir α < 1 does not hold for this Baird Q-learning example. E Proofs for the i.i.d. Linear Q-Learning Recursion In this section, we separate the stochastic proof into three steps: the exact switched recursion, a linear-growth bound for the martingale term, and the final sc alar r...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.