arxiv: 2605.11021 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Switching System Theory of Q-Learning with Linear Function Approximation

Donghwan Lee , Han-Dong Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords q-learninglinear function approximationswitched systemsjoint spectral radiusmean dynamicsconvergence analysisreinforcement learning

0 comments

The pith

The mean dynamics of Q-learning with linear function approximation are exactly equivalent to a linear switched system whose stability determines convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the average behavior of Q-learning with linear function approximation can be rewritten as an exact linear dynamical system that switches among a finite set of modes. Convergence of the parameter iterates is then equivalent to stability of that switched system, certified by the joint spectral radius of the mode set being less than one. The same modeling step applies to stochastic Q-learning driven by i.i.d. or Markovian samples and to regularized versions. A reader would care because the construction supplies a single parameter-space object that simultaneously captures projected Bellman equations, policy-induced switching, and stability criteria.

Core claim

We derive an exact linear switched model for the mean dynamics of Q-learning with linear function approximation. Convergence of the iterates is related to the stability of the corresponding switched system, characterized by the joint spectral radius of the switching modes. The construction extends to stochastic Q-learning with i.i.d. and Markovian data, as well as to regularized Q-learning.

What carries the argument

The finite set of linear switching modes that exactly represent the mean update, with stability certified by their joint spectral radius.

If this is right

Convergence holds if and only if the joint spectral radius of the mode set is strictly less than one.
Certificates based on finite products of modes can be less conservative than one-step norm bounds.
The same switched-system stability criterion applies to both i.i.d. and Markovian observation models.
Regularized Q-learning with linear function approximation inherits the identical stability characterization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the switched model is faithful, standard tools from switched-system theory could be repurposed to select function approximators that guarantee a joint spectral radius below one.
The explicit link to finite-difference stochastic-policy switching suggests that abrupt policy changes during learning appear as discrete mode jumps in parameter space.
The parameter-space formulation may extend to other temporal-difference algorithms whose mean updates can be expressed as convex combinations of linear operators.

Load-bearing premise

The mean dynamics of Q-learning with linear function approximation admit an exact representation as a finite collection of linear switching modes.

What would settle it

A numerical simulation in which the Q-learning parameter trajectory deviates from the trajectory generated by the switched linear model under identical initial conditions, step-size schedule, and sample sequence.

Figures

Figures reproduced from arXiv: 2605.11021 by Donghwan Lee, Han-Dong Lim.

**Figure 2.** Figure 2: Baird’s seven-star Q-learning example. Solid arro [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗

read the original abstract

This paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system. The same construction is then used for stochastic linear Q-learning with independent and identically distributed (i.i.d.) observations and with Markovian observations. Although exact JSR computation is difficult in general, the certificate captures products of switching modes and can be less conservative than one-step norm bounds. The framework also yields a JSR-based view of regularized Q-learning with LFA. The resulting analysis connects projected Bellman equations, finite-difference stochastic-policy switching, and switched-system stability in a single parameter-space formulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The switched linear model for Q-learning mean dynamics holds exactly under i.i.d. sampling but only approximately for Markovian observations because the state distribution carries history dependence.

read the letter

The paper's main move is to rewrite the mean-field update of Q-learning with linear function approximation as a switched linear system whose convergence is governed by the joint spectral radius of the mode matrices. They build the modes directly from the TD error expectation and show how the same construction covers both i.i.d. and Markovian sampling plus a regularized variant. For i.i.d. data the representation is exact because the expectation is taken under a fixed distribution independent of past iterates, and the JSR certificate can be less conservative than a single-norm contraction bound since it accounts for products of matrices. That part is clean and gives a concrete link between projected Bellman operators and switched-system stability theory. The regularized case follows the same route without extra machinery, which is a small but useful extension. The Markovian section is the soft spot. When observations come from a chain whose transition probabilities depend on the current parameter vector, the instantaneous state distribution is shaped by the entire trajectory of past updates. The effective linear map at step t is therefore not a function of theta_t alone, so the dynamics are not exactly switched linear. The JSR bound then applies only to an approximation, and the abstract's claim of an exact model for both sampling regimes does not hold. They correctly note that exact JSR computation is hard in general, which limits immediate practical use. The overall framing is clear and the i.i.d. derivation looks reproducible from the update equation. This is aimed at RL theorists who already work with stability arguments from control. It supplies a fresh angle worth checking even if the Markovian claim needs qualification or correction. I would send it to referees for a careful read on the derivations and whether the Markovian construction can be tightened.

Referee Report

1 major / 1 minor

Summary. The paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). It derives an exact linear switched model for the mean dynamics of the iterates and relates convergence to stability of the switched system. The construction is applied to stochastic linear Q-learning under both i.i.d. and Markovian observations, and is extended to a JSR-based view of regularized Q-learning with LFA, connecting projected Bellman equations, finite-difference policy switching, and switched-system stability in parameter space.

Significance. If the exact switched-linear mean dynamics hold, the framework supplies a parameter-space formulation that unifies several strands of RL theory and can yield less conservative stability certificates than one-step norm bounds by considering products of switching modes. The explicit link to JSR theory is a strength when the modes are few and the JSR is computable, offering a concrete route to convergence guarantees for approximate Q-learning.

major comments (1)

[Markovian observations derivation] The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.

minor comments (1)

[JSR computation paragraph] The paper notes that exact JSR computation is difficult in general; a brief discussion of practical upper-bound methods (e.g., via semidefinite programming or finite products) would help readers assess the tightness of the resulting certificates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The observation regarding the Markovian case identifies a genuine subtlety in the exactness of the switched-linear representation. We address this point directly below and outline the corresponding revisions.

read point-by-point responses

Referee: The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.

Authors: We agree with the referee that the conditional expectation E[TD error | theta_t] is not strictly a function of theta_t alone under Markovian observations, because the state distribution at time t depends on the full history of parameter values through the policy-induced chain. The manuscript's derivation conditions on theta_t while treating the observation distribution as determined by the current parameter; this step holds exactly only when observations are i.i.d. For the Markovian setting the resulting switched-linear model is therefore an approximation whose fidelity improves when parameter updates are slow relative to the mixing time of the underlying chain. We will revise the abstract, Section 4, and the discussion of convergence certificates to qualify the Markovian claim accordingly, stating that the JSR provides a useful but non-rigorous indicator rather than an exact certificate for the true mean-field trajectory. A brief remark on sufficient conditions for exactness (e.g., fixed theta or sufficiently slow variation) will also be added. revision: partial

Circularity Check

0 steps flagged

No circularity: switched model derived directly from update rule; JSR stability is external

full rationale

The paper constructs the linear switched representation by taking the conditional expectation of the Q-learning TD update under linear function approximation, yielding a finite collection of affine maps whose selection depends on the current parameter vector. This algebraic step follows immediately from the definition of the mean-field dynamics and does not presuppose the joint spectral radius result. Convergence is then related to stability of the switched system via the standard JSR criterion, which is imported as an independent mathematical tool rather than being fitted or redefined inside the paper. The same construction is applied to both i.i.d. and Markovian sampling without any self-referential closure or reduction of the claimed exactness to a fitted quantity. No load-bearing premise relies on self-citation chains or ansatzes smuggled from prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard properties of the joint spectral radius for discrete-time switched linear systems and on the standard projected Bellman operator for linear function approximation; no free parameters or new entities are introduced in the abstract.

axioms (2)

standard math The joint spectral radius determines stability of a discrete-time linear switched system
Invoked to equate convergence of the mean dynamics with stability of the switched system.
domain assumption The mean dynamics of Q-learning LFA can be written exactly as a finite collection of linear modes
Central modeling step stated in the abstract.

pith-pipeline@v0.9.0 · 5419 in / 1305 out tokens · 58022 ms · 2026-05-13T06:10:24.513147+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system... the Bellman optimality error induces a stochastic-policy-based switching system, and the stability object is the joint spectral radius (JSR) of the associated switching matrix family
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3 (Stochastic-policy linearization)... V_θ − V_¯θ = Π_μ Φ(θ − ¯θ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Error bounds for co nstant step-size Q-learning

Carolyn L Beck and Rayadurgam Srikant. Error bounds for co nstant step-size Q-learning. Systems & control letters , 61(12):1203–1208, 2012

work page 2012
[2]

Dynamic programming and optimal con trol 4th edition, volume ii

Dimitri P Bertsekas. Dynamic programming and optimal con trol 4th edition, volume ii. Athena Scientiﬁc , 2015

work page 2015
[3]

The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning

Vivek S Borkar and Sean P Meyn. The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning. SIAM Journal on Control and Optimization , 38(2):447–469, 2000

work page 2000
[4]

Melo, and Pedro Santos

Diogo Carvalho, Francisco S. Melo, and Pedro Santos. A ne w convergent variant of Q-learning with linear function approximation. In Advances in Neural Information Processing Systems , volume 33, pages 19412–19421, 2020. 22

work page 2020
[5]

Ramirez, Christopher K

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gumm adi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurman s. Target networks and over-parameterization stabilize oﬀ-policy bootstrappin g with function approximation. In Pro- ceedings of the 41st International Conference on Machine Le arning, volume 235 of Proceedings of M...

work page 2024
[6]

A Lya- punov theory for ﬁnite-sample guarantees of asynchronous Q -learning and TD-learning variants

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, an d Karthikeyan Shanmugam. A Lya- punov theory for ﬁnite-sample guarantees of asynchronous Q -learning and TD-learning variants. arXiv preprint arXiv:2102.01567 , 2021

work page arXiv 2021
[7]

Doan, John-Paul Clark e, and Siva Theja Maguluri

Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clark e, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation wit h applications in reinforcement learn- ing. Automatica, 146:110623, 2022. doi: 10.1016/j.automatica.2022.1106 23

work page doi:10.1016/j.automatica.2022.1106 2022
[8]

Target network and truncation overcome the deadly triad in Q-learning

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science , 5(4): 1078–1101, 2023. doi: 10.1137/22M1499261

work page doi:10.1137/22m1499261 2023
[9]

Learning rates for Q-l earning

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-l earning. Journal of machine learning Research, 5(Dec):1–25, 2003

work page 2003
[10]

A ﬁrst-order ap proach to accelerated value iteration

Vineet Goyal and Julien Grand-Clement. A ﬁrst-order ap proach to accelerated value iteration. Operations research, 71(2):517–535, 2023

work page 2023
[11]

Generating fu nctions of switched linear systems: analysis, computation, and stability applications

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating fu nctions of switched linear systems: analysis, computation, and stability applications. IEEE transactions on automatic control , 56 (5):1059–1074, 2010

work page 2010
[12]

The joint spectral radius: Theory and applications , volume 385

Raphaël Jungers. The joint spectral radius: Theory and applications , volume 385. Springer Science & Business Media, 2009

work page 2009
[13]

Finite-sample conv ergence rates for Q-learning and indirect algorithms

Michael Kearns and Satinder Singh. Finite-sample conv ergence rates for Q-learning and indirect algorithms. Advances in neural information processing systems , 11, 1998

work page 1998
[14]

Final iteration convergence bound of Q-l earning: Switching system approach

Donghwan Lee. Final iteration convergence bound of Q-l earning: Switching system approach. IEEE Transactions on Automatic Control , 69(7):4765–4772, 2024

work page 2024
[15]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certiﬁed direct switching the ory for q-learning. arXiv preprint arXiv:2604.19569, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

A uniﬁed switching system pers pective and convergence analysis of Q-learning algorithms

Donghwan Lee and Niao He. A uniﬁed switching system pers pective and convergence analysis of Q-learning algorithms. In 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

work page 2020
[17]

A discrete-time switching system analysis of Q- learning

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q- learning. SIAM Journal on Control and Optimization , 61(3):1861–1880, 2023

work page 2023
[18]

Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Che n. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion. IEEE Transactions on Infor- mation Theory , 68(1):448–473, 2021

work page 2021
[19]

Switching in systems and control

Daniel Liberzon. Switching in systems and control . Springer Science & Business Media, 2003. 23

work page 2003
[20]

Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view

Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view. IEEE Access, 12:149916–149939, 2024

work page 2024
[21]

Regularized q-learning

Han-Dong Lim and Donghwan Lee. Regularized q-learning . In Advances in Neural Information Processing Systems, volume 37, pages 129855–129887, 2024

work page 2024
[22]

Understanding the theor etical properties of pro- jected bellman equation, linear q-learning, and approxima te value iteration

Han-Dong Lim and Donghwan Lee. Understanding the theor etical properties of pro- jected bellman equation, linear q-learning, and approxima te value iteration. arXiv preprint arXiv:2504.10865, 2025

work page arXiv 2025
[23]

Stability and stabilizab ility of switched linear systems: A survey of recent results

Hai Lin and Panos J Antsaklis. Stability and stabilizab ility of switched linear systems: A survey of recent results. IEEE Transactions on Automatic control , 54(2):308–322, 2009

work page 2009
[24]

Linear Q-le arning does not diverge: Convergence rates to a bounded set

Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear Q-le arning does not diverge: Convergence rates to a bounded set. arXiv preprint arXiv:2501.19254 , 2025

work page arXiv 2025
[25]

Melo and M

Francisco S. Melo and M. Isabel Ribeiro. Convergence of Q-learning with linear function approximation. In 2007 European Control Conference (ECC) , pages 2671–2678. IEEE, 2007

work page 2007
[26]

Sean P. Meyn. The projected bellman equation in reinfor cement learning. IEEE Transactions on Automatic Control , 69(12):8323–8337, 2024. doi: 10.1109/TAC.2024.3409647

work page doi:10.1109/tac.2024.3409647 2024
[27]

Puterman

Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic pr ogramming. John Wiley & Sons, 2014

work page 2014
[28]

Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning. In Conference on learning theory , pages 3185–3205, 2020

work page 2020
[29]

A note on the joint s pectral radius

Gian-Carlo Rota and Gilbert Strang. A note on the joint s pectral radius. Indag. Math , 22(4): 379–381, 1960

work page 1960
[30]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction . MIT Press, 1998

work page 1998
[31]

The asymptotic convergence-rate of Q-learning

Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. In Advances in Neural Information Processing Systems , pages 1064–1070, 1998

work page 1998
[32]

Asynchronous stochastic approxima tion and Q-learning

John N Tsitsiklis. Asynchronous stochastic approxima tion and Q-learning. Machine learning , 16(3):185–202, 1994

work page 1994
[33]

The lyapunov exp onent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute a nd to approximate

John N Tsitsiklis and Vincent D Blondel. The lyapunov exp onent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute a nd to approximate. Mathematics of Control, Signals and Systems , 10(1):31–40, 1997

work page 1997
[34]

Q-learning

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992

work page 1992
[35]

Br eaking the deadly triad with a target network

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Br eaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machin e Learning , volume 139 of Proceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. 24 A Proof of the Equivalence Between the Projected Equation an d the Normal Equation The ﬁr...

work page 2021
[36]

8θ, θ < 0

145θ, θ ≥ 0, − 0. 8θ, θ < 0. Consequently, θ⋆ = 0 is a projected Bellman ﬁxed point. The two deterministic-po licy direct modes are A1 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·1 = 0 . 145, A2 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·(− 2) = − 0. 8. Now choose the initial parameter θ0 = − 2. The deterministic Q-learning trajectory is θ1 = Tα (− 2) = 1 . 6, θ 2 = T...

work page
[37]

001684277044667 > 1

25 ≥ ρ(Aπ 1) = 1 . 001684277044667 > 1. Therefore, the JSR suﬃcient condition ρdir α < 1 does not hold for this Baird Q-learning example. E Proofs for the i.i.d. Linear Q-Learning Recursion In this section, we separate the stochastic proof into three steps: the exact switched recursion, a linear-growth bound for the martingale term, and the ﬁnal sc alar r...

work page