pith. machine review for the scientific record. sign in

arxiv: 2605.11021 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Switching System Theory of Q-Learning with Linear Function Approximation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords q-learninglinear function approximationswitched systemsjoint spectral radiusmean dynamicsconvergence analysisreinforcement learning
0
0 comments X

The pith

The mean dynamics of Q-learning with linear function approximation are exactly equivalent to a linear switched system whose stability determines convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the average behavior of Q-learning with linear function approximation can be rewritten as an exact linear dynamical system that switches among a finite set of modes. Convergence of the parameter iterates is then equivalent to stability of that switched system, certified by the joint spectral radius of the mode set being less than one. The same modeling step applies to stochastic Q-learning driven by i.i.d. or Markovian samples and to regularized versions. A reader would care because the construction supplies a single parameter-space object that simultaneously captures projected Bellman equations, policy-induced switching, and stability criteria.

Core claim

We derive an exact linear switched model for the mean dynamics of Q-learning with linear function approximation. Convergence of the iterates is related to the stability of the corresponding switched system, characterized by the joint spectral radius of the switching modes. The construction extends to stochastic Q-learning with i.i.d. and Markovian data, as well as to regularized Q-learning.

What carries the argument

The finite set of linear switching modes that exactly represent the mean update, with stability certified by their joint spectral radius.

If this is right

  • Convergence holds if and only if the joint spectral radius of the mode set is strictly less than one.
  • Certificates based on finite products of modes can be less conservative than one-step norm bounds.
  • The same switched-system stability criterion applies to both i.i.d. and Markovian observation models.
  • Regularized Q-learning with linear function approximation inherits the identical stability characterization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the switched model is faithful, standard tools from switched-system theory could be repurposed to select function approximators that guarantee a joint spectral radius below one.
  • The explicit link to finite-difference stochastic-policy switching suggests that abrupt policy changes during learning appear as discrete mode jumps in parameter space.
  • The parameter-space formulation may extend to other temporal-difference algorithms whose mean updates can be expressed as convex combinations of linear operators.

Load-bearing premise

The mean dynamics of Q-learning with linear function approximation admit an exact representation as a finite collection of linear switching modes.

What would settle it

A numerical simulation in which the Q-learning parameter trajectory deviates from the trajectory generated by the switched linear model under identical initial conditions, step-size schedule, and sample sequence.

Figures

Figures reproduced from arXiv: 2605.11021 by Donghwan Lee, Han-Dong Lim.

Figure 1
Figure 1. Figure 1: Truncated Theorem 1 norm ball for the MDP in [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baird’s seven-star Q-learning example. Solid arro [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗
read the original abstract

This paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system. The same construction is then used for stochastic linear Q-learning with independent and identically distributed (i.i.d.) observations and with Markovian observations. Although exact JSR computation is difficult in general, the certificate captures products of switching modes and can be less conservative than one-step norm bounds. The framework also yields a JSR-based view of regularized Q-learning with LFA. The resulting analysis connects projected Bellman equations, finite-difference stochastic-policy switching, and switched-system stability in a single parameter-space formulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). It derives an exact linear switched model for the mean dynamics of the iterates and relates convergence to stability of the switched system. The construction is applied to stochastic linear Q-learning under both i.i.d. and Markovian observations, and is extended to a JSR-based view of regularized Q-learning with LFA, connecting projected Bellman equations, finite-difference policy switching, and switched-system stability in parameter space.

Significance. If the exact switched-linear mean dynamics hold, the framework supplies a parameter-space formulation that unifies several strands of RL theory and can yield less conservative stability certificates than one-step norm bounds by considering products of switching modes. The explicit link to JSR theory is a strength when the modes are few and the JSR is computable, offering a concrete route to convergence guarantees for approximate Q-learning.

major comments (1)
  1. [Markovian observations derivation] The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.
minor comments (1)
  1. [JSR computation paragraph] The paper notes that exact JSR computation is difficult in general; a brief discussion of practical upper-bound methods (e.g., via semidefinite programming or finite products) would help readers assess the tightness of the resulting certificates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The observation regarding the Markovian case identifies a genuine subtlety in the exactness of the switched-linear representation. We address this point directly below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: The central claim of an exact finite switched-linear representation for the mean dynamics under Markovian observations (abstract and the corresponding derivation section) requires that the conditional expectation E[TD error | theta_t] equals one of a finite collection of affine maps whose active mode depends only on the current theta_t. Because the underlying state distribution is generated by the Markov chain driven by the entire history of theta values, the effective linear map at step t is not in general a function of theta_t alone; this renders the switched system approximate rather than exact and undermines the direct use of its JSR as a convergence certificate for the true mean-field trajectory.

    Authors: We agree with the referee that the conditional expectation E[TD error | theta_t] is not strictly a function of theta_t alone under Markovian observations, because the state distribution at time t depends on the full history of parameter values through the policy-induced chain. The manuscript's derivation conditions on theta_t while treating the observation distribution as determined by the current parameter; this step holds exactly only when observations are i.i.d. For the Markovian setting the resulting switched-linear model is therefore an approximation whose fidelity improves when parameter updates are slow relative to the mixing time of the underlying chain. We will revise the abstract, Section 4, and the discussion of convergence certificates to qualify the Markovian claim accordingly, stating that the JSR provides a useful but non-rigorous indicator rather than an exact certificate for the true mean-field trajectory. A brief remark on sufficient conditions for exactness (e.g., fixed theta or sufficiently slow variation) will also be added. revision: partial

Circularity Check

0 steps flagged

No circularity: switched model derived directly from update rule; JSR stability is external

full rationale

The paper constructs the linear switched representation by taking the conditional expectation of the Q-learning TD update under linear function approximation, yielding a finite collection of affine maps whose selection depends on the current parameter vector. This algebraic step follows immediately from the definition of the mean-field dynamics and does not presuppose the joint spectral radius result. Convergence is then related to stability of the switched system via the standard JSR criterion, which is imported as an independent mathematical tool rather than being fitted or redefined inside the paper. The same construction is applied to both i.i.d. and Markovian sampling without any self-referential closure or reduction of the claimed exactness to a fitted quantity. No load-bearing premise relies on self-citation chains or ansatzes smuggled from prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard properties of the joint spectral radius for discrete-time switched linear systems and on the standard projected Bellman operator for linear function approximation; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • standard math The joint spectral radius determines stability of a discrete-time linear switched system
    Invoked to equate convergence of the mean dynamics with stability of the switched system.
  • domain assumption The mean dynamics of Q-learning LFA can be written exactly as a finite collection of linear modes
    Central modeling step stated in the abstract.

pith-pipeline@v0.9.0 · 5419 in / 1305 out tokens · 58022 ms · 2026-05-13T06:10:24.513147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Error bounds for co nstant step-size Q-learning

    Carolyn L Beck and Rayadurgam Srikant. Error bounds for co nstant step-size Q-learning. Systems & control letters , 61(12):1203–1208, 2012

  2. [2]

    Dynamic programming and optimal con trol 4th edition, volume ii

    Dimitri P Bertsekas. Dynamic programming and optimal con trol 4th edition, volume ii. Athena Scientific , 2015

  3. [3]

    The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning

    Vivek S Borkar and Sean P Meyn. The ODE method for convergen ce of stochastic approxima- tion and reinforcement learning. SIAM Journal on Control and Optimization , 38(2):447–469, 2000

  4. [4]

    Melo, and Pedro Santos

    Diogo Carvalho, Francisco S. Melo, and Pedro Santos. A ne w convergent variant of Q-learning with linear function approximation. In Advances in Neural Information Processing Systems , volume 33, pages 19412–19421, 2020. 22

  5. [5]

    Ramirez, Christopher K

    Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gumm adi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurman s. Target networks and over-parameterization stabilize off-policy bootstrappin g with function approximation. In Pro- ceedings of the 41st International Conference on Machine Le arning, volume 235 of Proceedings of M...

  6. [6]

    A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants

    Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, an d Karthikeyan Shanmugam. A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants. arXiv preprint arXiv:2102.01567 , 2021

  7. [7]

    Doan, John-Paul Clark e, and Siva Theja Maguluri

    Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clark e, and Siva Theja Maguluri. Finite- sample analysis of nonlinear stochastic approximation wit h applications in reinforcement learn- ing. Automatica, 146:110623, 2022. doi: 10.1016/j.automatica.2022.1106 23

  8. [8]

    Target network and truncation overcome the deadly triad in Q-learning

    Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science , 5(4): 1078–1101, 2023. doi: 10.1137/22M1499261

  9. [9]

    Learning rates for Q-l earning

    Eyal Even-Dar and Yishay Mansour. Learning rates for Q-l earning. Journal of machine learning Research, 5(Dec):1–25, 2003

  10. [10]

    A first-order ap proach to accelerated value iteration

    Vineet Goyal and Julien Grand-Clement. A first-order ap proach to accelerated value iteration. Operations research, 71(2):517–535, 2023

  11. [11]

    Generating fu nctions of switched linear systems: analysis, computation, and stability applications

    Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating fu nctions of switched linear systems: analysis, computation, and stability applications. IEEE transactions on automatic control , 56 (5):1059–1074, 2010

  12. [12]

    The joint spectral radius: Theory and applications , volume 385

    Raphaël Jungers. The joint spectral radius: Theory and applications , volume 385. Springer Science & Business Media, 2009

  13. [13]

    Finite-sample conv ergence rates for Q-learning and indirect algorithms

    Michael Kearns and Satinder Singh. Finite-sample conv ergence rates for Q-learning and indirect algorithms. Advances in neural information processing systems , 11, 1998

  14. [14]

    Final iteration convergence bound of Q-l earning: Switching system approach

    Donghwan Lee. Final iteration convergence bound of Q-l earning: Switching system approach. IEEE Transactions on Automatic Control , 69(7):4765–4772, 2024

  15. [15]

    Lyapunov-Certified Direct Switching Theory for Q-Learning

    Donghwan Lee. Lyapunov-certified direct switching the ory for q-learning. arXiv preprint arXiv:2604.19569, 2026

  16. [16]

    A unified switching system pers pective and convergence analysis of Q-learning algorithms

    Donghwan Lee and Niao He. A unified switching system pers pective and convergence analysis of Q-learning algorithms. In 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

  17. [17]

    A discrete-time switching system analysis of Q- learning

    Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q- learning. SIAM Journal on Control and Optimization , 61(3):1861–1880, 2023

  18. [18]

    Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion

    Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Che n. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduct ion. IEEE Transactions on Infor- mation Theory , 68(1):448–473, 2021

  19. [19]

    Switching in systems and control

    Daniel Liberzon. Switching in systems and control . Springer Science & Business Media, 2003. 23

  20. [20]

    Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view

    Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view. IEEE Access, 12:149916–149939, 2024

  21. [21]

    Regularized q-learning

    Han-Dong Lim and Donghwan Lee. Regularized q-learning . In Advances in Neural Information Processing Systems, volume 37, pages 129855–129887, 2024

  22. [22]

    Understanding the theor etical properties of pro- jected bellman equation, linear q-learning, and approxima te value iteration

    Han-Dong Lim and Donghwan Lee. Understanding the theor etical properties of pro- jected bellman equation, linear q-learning, and approxima te value iteration. arXiv preprint arXiv:2504.10865, 2025

  23. [23]

    Stability and stabilizab ility of switched linear systems: A survey of recent results

    Hai Lin and Panos J Antsaklis. Stability and stabilizab ility of switched linear systems: A survey of recent results. IEEE Transactions on Automatic control , 54(2):308–322, 2009

  24. [24]

    Linear Q-le arning does not diverge: Convergence rates to a bounded set

    Xinyu Liu, Zixuan Xie, and Shangtong Zhang. Linear Q-le arning does not diverge: Convergence rates to a bounded set. arXiv preprint arXiv:2501.19254 , 2025

  25. [25]

    Melo and M

    Francisco S. Melo and M. Isabel Ribeiro. Convergence of Q-learning with linear function approximation. In 2007 European Control Conference (ECC) , pages 2671–2678. IEEE, 2007

  26. [26]

    Sean P. Meyn. The projected bellman equation in reinfor cement learning. IEEE Transactions on Automatic Control , 69(12):8323–8337, 2024. doi: 10.1109/TAC.2024.3409647

  27. [27]

    Puterman

    Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic pr ogramming. John Wiley & Sons, 2014

  28. [28]

    Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning

    Guannan Qu and Adam Wierman. Finite-time analysis of as ynchronous stochastic approxima- tion and Q-learning. In Conference on learning theory , pages 3185–3205, 2020

  29. [29]

    A note on the joint s pectral radius

    Gian-Carlo Rota and Gilbert Strang. A note on the joint s pectral radius. Indag. Math , 22(4): 379–381, 1960

  30. [30]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction . MIT Press, 1998

  31. [31]

    The asymptotic convergence-rate of Q-learning

    Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. In Advances in Neural Information Processing Systems , pages 1064–1070, 1998

  32. [32]

    Asynchronous stochastic approxima tion and Q-learning

    John N Tsitsiklis. Asynchronous stochastic approxima tion and Q-learning. Machine learning , 16(3):185–202, 1994

  33. [33]

    The lyapunov exp onent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute a nd to approximate

    John N Tsitsiklis and Vincent D Blondel. The lyapunov exp onent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute a nd to approximate. Mathematics of Control, Signals and Systems , 10(1):31–40, 1997

  34. [34]

    Q-learning

    Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992

  35. [35]

    Br eaking the deadly triad with a target network

    Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Br eaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machin e Learning , volume 139 of Proceedings of Machine Learning Research, pages 12621–12631. PMLR, 2021. 24 A Proof of the Equivalence Between the Projected Equation an d the Normal Equation The fir...

  36. [36]

    8θ, θ < 0

    145θ, θ ≥ 0, − 0. 8θ, θ < 0. Consequently, θ⋆ = 0 is a projected Bellman fixed point. The two deterministic-po licy direct modes are A1 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·1 = 0 . 145, A2 = 1 − 0. 9 ·1. 3 + 0. 9 · 1 2 ·0. 7 ·(− 2) = − 0. 8. Now choose the initial parameter θ0 = − 2. The deterministic Q-learning trajectory is θ1 = Tα (− 2) = 1 . 6, θ 2 = T...

  37. [37]

    001684277044667 > 1

    25 ≥ ρ(Aπ 1) = 1 . 001684277044667 > 1. Therefore, the JSR sufficient condition ρdir α < 1 does not hold for this Baird Q-learning example. E Proofs for the i.i.d. Linear Q-Learning Recursion In this section, we separate the stochastic proof into three steps: the exact switched recursion, a linear-growth bound for the martingale term, and the final sc alar r...