arxiv: 2604.19569 · v3 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Recognition: unknown

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee

Pith reviewed 2026-05-10 02:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords Q-learningswitching systemsjoint spectral radiusLyapunov functionsconvergence analysisreinforcement learningfinite-time bounds

0 comments

The pith

Q-learning error dynamics reduce to a switched linear system whose rate is the joint spectral radius of the switches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a direct stochastic switching-system representation for the error process in standard Q-learning. By rewriting the Bellman maximization error exactly as an average of action-wise Q-errors under a suitable stochastic policy, the update recursion becomes a switched linear system driven by martingale-difference noise. The leading exponential convergence rate is then given by the joint spectral radius of the associated direct switching family, which is the exact worst-case rate for the deterministic drift. A sympathetic reader would care because this supplies finite-time bounds via JSR-induced Lyapunov functions that are among the tightest drift-based rates available for this representation and hold whenever the JSR is below one.

Core claim

The central claim is that standard Q-learning admits a direct stochastic switching-system representation in which the conditional-mean drift is a switched linear system, and the leading exponential convergence rate is given exactly by the joint spectral radius of that switching family. Finite-time bounds are obtained from a product-defined JSR-induced Lyapunov function that applies to the full family when the JSR is less than one, while an optional common quadratic Lyapunov certificate yields simpler bounds via a matrix inequality when it is feasible; the framework also extends to Markovian observation models.

What carries the argument

The direct switching-system representation of the Q-learning error, which encodes the Bellman maximization as an exact convex combination of action-wise errors under a stochastic policy and produces a switched linear drift with martingale noise.

If this is right

The JSR supplies the exact worst-case exponential rate for the switched linear drift in this Q-learning representation.
Finite-time bounds follow directly from the product-defined JSR-induced Lyapunov function whenever the JSR is below one.
When a common quadratic Lyapunov certificate exists, it replaces product-based verification by a single matrix inequality and yields a simpler stochastic bound.
The same switching representation and rate analysis carry over to Markovian observation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the JSR can be bounded or approximated for concrete MDPs, the bounds could be used to select exploration policies that keep the radius small and thereby speed up learning.
The same direct-switching construction may extend to other value-iteration methods that rely on maximization, such as SARSA or actor-critic variants.
Practical computation of the JSR via existing numerical tools would turn the theoretical rate into a verifiable certificate for moderate-sized problems.

Load-bearing premise

The Bellman maximization error can be expressed exactly as an average of action-wise Q-errors under a suitable stochastic policy.

What would settle it

Compute the joint spectral radius for the switching matrices arising from a simple finite MDP; if the radius is less than one yet repeated runs of standard Q-learning on that MDP fail to exhibit the predicted exponential decay rate, the representation or rate claim is refuted.

read the original abstract

Q-learning is a fundamental algorithmic primitive in reinforcement learning. This paper develops a new framework for analyzing Q-learning from a switching-system viewpoint. In particular, we derive a direct stochastic switching-system representation of the Q-learning error. The key observation is that the Bellman maximization error can be expressed exactly as an average of action-wise Q-errors under a suitable stochastic policy. The resulting recursion has a switched linear conditional-mean drift and martingale-difference noise. To the best of our knowledge, this is the first convergence-rate analysis of standard Q-learning whose leading exponential rate is expressed through the joint spectral radius (JSR) of a direct switching family. Since the JSR is the exact worst-case exponential rate of the associated switched linear drift, the resulting rate is among the tightest drift-based rates that can be certified for this Q-learning representation. Building on this representation, we prove finite-time bounds based on a product-defined JSR-induced Lyapunov function and also give an optional common quadratic Lyapunov certificate. The quadratic certificate is only a sufficient condition and hence applies only to instances for which the certificate is feasible, whereas the JSR-induced Lyapunov construction applies to the full direct switching family whenever its JSR is below one. When feasible, the quadratic certificate replaces product-based verification by a computable matrix inequality and gives a simpler stochastic bound. We further extend the framework to Markovian observation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The switched linear representation for Q-learning error does not hold because the Bellman maximization step introduces a state-dependent bias that prevents a fixed family of matrices.

read the letter

The paper tries to write the Q-learning error recursion as a stochastic switched linear system whose rate is controlled by the joint spectral radius of the switching matrices. If that step were valid it would give a clean way to certify exponential rates via standard tools from switched systems theory, plus Lyapunov functions built from the JSR or from a common quadratic form when one exists. The finite-time bounds and the Markovian extension follow once the representation is in place, and the distinction between the product JSR certificate and the quadratic one is laid out clearly. That framing and the JSR application to standard Q-learning look new relative to the usual contraction-mapping or ODE approaches cited in the abstract. The formal development after the representation step is careful and uses established Lyapunov techniques for switched systems. The central difficulty is that the representation itself does not work. The claim that the Bellman error max(Q* + e) - max Q* equals the expected per-action error under a suitable policy π leaves out a non-positive bias term E_π[Q*] - max Q* that is zero only when π is supported on optimal actions. When the current greedy action differs from an optimal one, the bias is strictly negative and depends on the current Q through the choice of π. This makes the conditional-mean drift piecewise affine rather than linear in the error vector, so no fixed family of matrices governs the recursion and the JSR of any such family does not give the actual rate. The stress-test note is correct on this point. The rest of the paper is solid work on an incorrect model. Readers working on RL convergence rates or switched-system methods might find the attempt useful as a prompt for fixing the bias or for bounding it, but the main claim does not go through. I would send it to review rather than desk-reject so the authors can address the representation step directly.

Referee Report

2 major / 2 minor

Summary. The paper develops a switching-system framework for analyzing the convergence of standard Q-learning. It claims that the Q-learning error recursion admits a direct stochastic switched-linear representation whose conditional-mean drift is governed by a family of matrices whose joint spectral radius (JSR) supplies the leading exponential rate; finite-time bounds are then obtained from a product-defined JSR-induced Lyapunov function (with an optional common quadratic certificate when feasible). The framework is extended to Markovian observation models.

Significance. If the central representation holds exactly, the work would supply the first JSR-based rate for vanilla Q-learning and a Lyapunov certificate that is tight for the switched drift whenever the JSR is less than one. The explicit separation of the quadratic certificate (sufficient but computable) from the JSR construction (necessary and sufficient for the family) is a useful distinction. Reproducible code or machine-checked proofs are not mentioned.

major comments (2)

[Abstract / §3 (key identity)] Abstract and the derivation of the switched representation (presumably §3): the claim that the Bellman maximization error equals exactly the average of action-wise Q-errors under a suitable stochastic policy π is load-bearing for the entire switched-linear claim. The identity max(Q* + e) − max Q* = E_π[e] + (E_π[Q*] − max Q*) contains a non-positive bias term E_π[Q*] − max Q* that vanishes only when π is supported on an optimal action for Q*. When the current greedy action differs from an optimal action, the bias is strictly negative and state-dependent, rendering the conditional drift piecewise affine rather than linear in e. Consequently the recursion cannot be written as a switched linear system driven by a fixed matrix family plus martingale-difference noise, and the JSR of any fixed family does not govern the exponential rate.
[§4, §5] §4 (finite-time bounds) and §5 (quadratic certificate): the JSR-induced Lyapunov function and the matrix-inequality certificate both presuppose the switched-linear drift derived in §3. If the bias term identified above is present, the product-based bounds and the LMI feasibility condition no longer apply to the actual Q-learning process; the manuscript should either derive the exact (biased) recursion or restrict the claim to the special case in which the bias vanishes identically.

minor comments (2)

[§3] Notation for the stochastic policy π that realizes the averaging should be introduced explicitly when the key identity is stated, together with the precise support condition that makes the bias zero.
[§6] The extension to Markovian observations (§6) inherits the same representation; any correction to the core identity must be propagated there as well.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and insightful review. The comments correctly identify a subtlety in the key identity used to obtain the switched-linear representation. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract / §3 (key identity): the claim that the Bellman maximization error equals exactly the average of action-wise Q-errors under a suitable stochastic policy π is load-bearing for the entire switched-linear claim. The identity max(Q* + e) − max Q* = E_π[e] + (E_π[Q*] − max Q*) contains a non-positive bias term E_π[Q*] − max Q* that vanishes only when π is supported on an optimal action for Q*. When the current greedy action differs from an optimal action, the bias is strictly negative and state-dependent, rendering the conditional drift piecewise affine rather than linear in e. Consequently the recursion cannot be written as a switched linear system driven by a fixed matrix family plus martingale-difference noise, and the JSR of any fixed family does not govern the exponential rate.

Authors: We thank the referee for this precise observation. The identity indeed includes the bias term (E_π[Q*] − max Q*) whenever π is supported on the argmax of the perturbed values Q* + e but not on the argmax of Q* alone. This renders the conditional drift piecewise affine. In the manuscript the “suitable stochastic policy” is chosen to realize the max operator exactly, so the representation is exact only when the argmax actions coincide. We will revise §3 to state the identity with the bias term made explicit, derive the resulting recursion, and show that the bias is a state-dependent but bounded perturbation whose effect on the leading exponential rate can be absorbed into the JSR analysis (the JSR of the linear part remains an upper bound on the contraction rate). The abstract will be updated to reflect the precise statement. This preserves the core JSR-based rate while correcting the claim of exact linearity. revision: partial
Referee: §4, §5: the JSR-induced Lyapunov function and the matrix-inequality certificate both presuppose the switched-linear drift derived in §3. If the bias term identified above is present, the product-based bounds and the LMI feasibility condition no longer apply to the actual Q-learning process; the manuscript should either derive the exact (biased) recursion or restrict the claim to the special case in which the bias vanishes identically.

Authors: We agree that the finite-time bounds and quadratic certificate in §§4–5 rest on the form of the drift. With the revised recursion in §3 that includes the bias, we will update the proofs in §4 to treat the bias as an additive perturbation whose magnitude is controlled by the current error norm; the product-defined JSR-induced Lyapunov function remains valid and yields finite-time bounds that differ from the original only by an additive term that vanishes exponentially. For the common quadratic certificate in §5 we will either restrict its applicability to the bias-free case or supply a modified LMI that accounts for the worst-case bias. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies JSR to explicitly constructed switched representation

full rationale

The paper constructs switching matrices directly from the Q-learning update rule and the claimed exact averaging of Bellman error under a stochastic policy, then invokes the mathematical definition of JSR as the worst-case growth rate of the resulting switched linear system. This is a standard application of an external tool to a derived model rather than a reduction of the claimed rate to itself by construction. No parameters are fitted to data and relabeled as predictions, no self-citations justify core uniqueness or ansatzes, and the central bound follows from the representation's validity (argued via the key observation) without tautological redefinition. The analysis remains self-contained against external benchmarks such as the definition of JSR.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard finite-state MDP assumptions (implicit in Q-learning) and the new claim that the Bellman error equals a stochastic average of action-wise errors. No explicit free parameters or invented entities are introduced beyond the switching family itself.

axioms (1)

domain assumption The MDP has finite state and action spaces and the Q-learning update uses a suitable stochastic policy that makes the Bellman error an exact average of action-wise errors.
Invoked in the abstract to obtain the switched linear representation.

pith-pipeline@v0.9.0 · 5545 in / 1341 out tokens · 28143 ms · 2026-05-10T02:18:09.652167+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Switching-Geometry Analysis of Deflated Q-Value Iteration
math.OC 2026-05 unverdicted novelty 7.0

Deflated Q-VI is algebraically equivalent to recentering standard Q-VI, yet its error dynamics are governed by the joint spectral radius of a projected switching system that can be strictly smaller than the discount factor γ.
A Switching System Theory of Q-Learning with Linear Function Approximation
cs.LG 2026-05 unverdicted novelty 7.0

Q-learning with linear function approximation is recast as a switched linear system whose mean dynamics converge precisely when the joint spectral radius of the switching matrices is less than one.

Reference graph

Works this paper leans on

40 extracted references · 2 canonical work pages · cited by 2 Pith papers

[1]

Error bounds for constant step-size Q-learning

Carolyn L Beck and Rayadurgam Srikant. Error bounds for constant step-size Q-learning. Systems & control letters, 61(12):1203–1208, 2012

2012
[2]

Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

2005
[3]

The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469, 2000

2000
[4]

SIAM, Philadelphia, PA, 1994

Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan.Linear Matrix Inequalities in System and Control Theory, volume 15 ofStudies in Applied Mathematics. SIAM, Philadelphia, PA, 1994

1994
[5]

A Lya- punov theory for ﬁnite-sample guarantees of asynchronous Q -learning and TD-learning variants

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants.arXiv preprint arXiv:2102.01567, 2021

work page arXiv 2021
[6]

Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning.Journal of machine learning Research, 5(Dec):1–25, 2003

2003
[7]

A first-order approach to accelerated value iteration

Vineet Goyal and Julien Grand-Clement. A first-order approach to accelerated value iteration. Operations research, 71(2):517–535, 2023

2023
[8]

Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE transactions on automatic control, 56 (5):1059–1074, 2010

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications.IEEE transactions on automatic control, 56 (5):1059–1074, 2010

2010
[9]

Resilient stabilization of switched linear control systems against adversarial switching.IEEE Transactions on Automatic Control, 62(8): 3820–3834, 2017

Jianghai Hu, Jinglai Shen, and Donghwan Lee. Resilient stabilization of switched linear control systems against adversarial switching.IEEE Transactions on Automatic Control, 62(8): 3820–3834, 2017

2017
[10]

Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms.Advances in neural information processing systems, 6, 1993

1993
[11]

Springer Science & Business Media, 2009

Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009

2009
[12]

Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

Michael Kearns and Satinder Singh. Finite-sample convergence rates for Q-learning and indirect algorithms.Advances in neural information processing systems, 11, 1998

1998
[13]

George Yin.Stochastic approximation and recursive algorithms and applications, volume 35

Harold Kushner and G. George Yin.Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

2003
[14]

Final iteration convergence bound of Q-learning: Switching system approach

Donghwan Lee. Final iteration convergence bound of Q-learning: Switching system approach. IEEE Transactions on Automatic Control, 69(7):4765–4772, 2024

2024
[15]

A unified switching system perspective and convergence analysis of Q-learning algorithms

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. In34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020

2020
[16]

A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning.SIAM Journal on Control and Optimization, 61(3):1861–1880, 2023

2023
[17]

Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asyn- chronous Q-learning: Sharper analysis and variance reduction.IEEE Transactions on Informa- tion Theory, 68(1):448–473, 2021

2021
[18]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003

2003
[19]

Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

Han-Dong Lim and Donghwan Lee. Finite-time analysis of asynchronous Q-learning under diminishing step-size from control-theoretic view.IEEE Access, 12:149916–149939, 2024

2024
[20]

Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009

Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic control, 54(2):308–322, 2009. 10

2009
[21]

Puterman.Markov decision processes: Discrete stochastic dynamic programming

Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

2014
[22]

Finite-time analysis of asynchronous stochastic approxima- tion and Q-learning

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approxima- tion and Q-learning. InConference on learning theory, pages 3185–3205, 2020

2020
[23]

A note on the joint spectral radius.Indag

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960

1960
[24]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

2007
[25]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning: An introduction. MIT Press, 1998

1998
[26]

The asymptotic convergence-rate of Q-learning

Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. InAdvances in Neural Information Processing Systems, pages 1064–1070, 1998

1998
[27]

Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

John N Tsitsiklis. Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202, 1994

1994
[28]

John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997

1997
[29]

Stochastic approximation with cone-contractive operators: Sharp l∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharp l∞- bounds for Q-learning.arXiv preprint arXiv:1905.06265, 2019

work page arXiv 1905
[30]

Q-learning.Machine learning, 8(3):279–292, 1992

Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3):279–292, 1992. A Proofs Deferred from the Main Text A.1 Boundedness and noise variance Lemma 2(Boundedness and noise variance; cf. [ 6, 27, 29]).Under Assumption 1, let {Fk}k≥0 be the natural filtration defined in Section 2.4. In particular,Q0 is deterministic. The Q-learning itera...

1992
[31]

For everyt≥0, V t+1 ε (x)≥ ∥x∥ 2 2 +β −2 ε max i∈{1,...,M} V t ε (Aix),∀x∈R m.(13)
[32]

For everyt≥0, everyx∈R m, and everyλ∈R, V t ε (λx) =|λ| 2V t ε (x), and V t ε (x)≤V t+1 ε (x)
[33]

There exists a constantC ε >0such that ∥x∥2 2 ≤V t ε (x)≤C ε∥x∥2 2,∀x∈R m,∀t≥0.(14)
[34]

Moreover, ∥x∥2 2 ≤V ∞ ε (x)≤C ε∥x∥2 2,∀x∈R m.(15) 17

For everyx∈R m, the limit V ∞ ε (x) := lim t→∞ V t ε (x) exists and is finite. Moreover, ∥x∥2 2 ≤V ∞ ε (x)≤C ε∥x∥2 2,∀x∈R m.(15) 17
[35]

The function pε(x) := p V ∞ε (x) is a norm onR m
[36]

The functionV ∞ ε satisfies the stronger Lyapunov inequality V ∞ ε (Aix)≤β 2 ε V ∞ ε (x)− ∥x∥ 2 2 ≤β 2 ε V ∞ ε (x),∀x∈R m,∀i∈ {1, . . . , M}. Equivalently, pε(Aix)≤β εpε(x),∀x∈R m,∀i∈ {1, . . . , M}
[37]

, M}, where∥A i∥pε is the induced matrix norm generated byp ε, ∥Ai∥pε := sup x̸=0 pε(Aix) pε(x)

Consequently, ∥Ai∥pε ≤β ε,∀i∈ {1, . . . , M}, where∥A i∥pε is the induced matrix norm generated byp ε, ∥Ai∥pε := sup x̸=0 pε(Aix) pε(x) . Therefore, the induced-norm bound yields ρjsr(H)≤β ε. Proof.We prove the statements one by one. Proof of 1).By definition, V t+1 ε (x) = t+1X k=0 β−2k ε max σ∈{1,...,M} k ∥Aσx∥2 2. The k= 0 term is ∥x∥2
[38]

A word of length j+ 1 may be described by its first applied index i∈ {1,

For k≥1 , write k=j+ 1 . A word of length j+ 1 may be described by its first applied index i∈ {1, . . . , M} and the remaining word τ∈ {1, . . . , M} j. Therefore, decomposing by the first applied index gives V t+1 ε (x) =∥x∥ 2 2 + tX j=0 β−2(j+1) ε max i∈{1,...,M} max τ∈{1,...,M} j ∥Aτ Aix∥2 2 =∥x∥ 2 2 +β −2 ε tX j=0 β−2j ε max i∈{1,...,M} max τ∈{1,...,M...
[39]

22 Taking total expectation gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax, a k :=E[V ∞ ε (ek)]

Hence, combining Equation (25) and Equation (26) yields E[V ∞ ε (ek+1)| F k]≤β 2 ε V ∞ ε (ek) +α 2C2 ε Wmax. 22 Taking total expectation gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax, a k :=E[V ∞ ε (ek)]. Iterating this recursion yields ak ≤β 2k ε a0 +α 2C2 ε Wmax k−1X j=0 β2j ε .(27) Since k−1X j=0 β2j ε = 1−β 2k ε 1−β 2ε ,(28) combining Equati...
[40]

Taking total expectation and setting ak :=E[V ∞ ε (ek)] gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax + (4(1 +γ)B Q)2

Hence, it follows from Equation (56) that E[V ∞ ε (ek+1)| F M k ]≤β 2 ε V ∞ ε (ek) +α 2Cε(Cε −1) Wmax + (4(1 +γ)B Q)2 +α 2Cε Wmax + (4(1 +γ)B Q)2 =β 2 ε V ∞ ε (ek) +α 2C2 ε Wmax + (4(1 +γ)B Q)2 . Taking total expectation and setting ak :=E[V ∞ ε (ek)] gives the scalar recursion ak+1 ≤β 2 ε ak +α 2C2 ε Wmax + (4(1 +γ)B Q)2 . Iterating it gives ak ≤β 2k ε a...