arxiv: 2605.10811 · v1 · submitted 2026-05-11 · 🧮 math.OC · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Switching-Geometry Analysis of Deflated Q-Value Iteration

Donghwan Lee

Pith reviewed 2026-05-12 03:14 UTC · model grok-4.3

classification 🧮 math.OC cs.AI

keywords joint spectral radiusdeflated Q-value iterationMarkov decision processesswitching systemsconvergence analysisquotient spacepolicy optimization

0 comments

The pith

Deflated Q-value iteration obtains a convergence rate from a projected joint spectral radius that can be strictly smaller than the discount factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper interprets rank-one deflated Q-value iteration as a switching system whose error dynamics are analyzed via the joint spectral radius. It first shows that the standard Q-VI model always has JSR exactly equal to the discount factor γ because every admissible subsystem leaves the all-ones vector invariant. Passing to the quotient space that removes this direction produces a projected switching model whose JSR governs the relevant contraction and may be smaller than γ. The deflation step is shown to be equivalent to a scalar recentering of the standard iterates, so the sequence of greedy policies is identical to that of ordinary Q-VI started from the same point. A reader would care because the tighter geometric description certifies faster convergence without changing the decision-making behavior of the algorithm.

Core claim

The standard Q-VI switching system model has JSR exactly γ since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than γ. The correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry

What carries the argument

The projected switching system model obtained by quotienting out the all-ones invariant direction from the standard Q-VI switching model, whose joint spectral radius governs the deflated error dynamics.

If this is right

The convergence rate of deflated Q-VI is governed by the JSR of the projected system rather than by γ.
This projected JSR can be strictly smaller than γ for some MDPs, yielding a sharper rate certificate.
The sequence of greedy policies generated by deflated Q-VI is identical to that of standard Q-VI from the same initialization.
Deflation functions purely as a geometric recentering that does not alter the underlying decision-making problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quotient-space reduction could be applied to other iterative schemes in reinforcement learning that possess redundant invariant directions.
Explicit computation of the projected JSR for given MDPs might serve as a diagnostic for convergence speed in practice.
The framework suggests that analogous geometric projections could tighten rate analyses for undiscounted or average-reward value iteration.

Load-bearing premise

Every admissible subsystem in the standard Q-VI switching model shares the all-ones vector as an invariant direction.

What would settle it

A concrete MDP together with its admissible policies for which at least one subsystem matrix fails to map the all-ones vector to itself; the projection would then be invalid and the claimed possibility of a strictly smaller JSR would not hold.

Figures

Figures reproduced from arXiv: 2605.10811 by Donghwan Lee.

**Figure 2.** Figure 2: Full Q-function error for standard Q-VI and deflated Q-VI with uniform and nonuniform [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence of the full Q-function error [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor $\gamma\in (0,1)$, since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than $\gamma$. Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space $\gamma$-bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows deflated Q-VI is algebraically identical to recentering standard Q-VI and gives a projected JSR that can be strictly smaller than gamma for the policy-relevant error.

read the letter

The main thing to know is that this work uses the quotient space to strip out the all-ones invariant direction from the switching system that models Q-VI. Because every fixed-policy matrix leaves that vector invariant (from the row-stochastic transitions), the ambient JSR is exactly gamma, but the projected version that governs the error component driving policy changes can be smaller. They also show by direct algebra that the rank-one deflation is equivalent to subtracting a scalar from the Q-values, so the greedy policy sequence stays exactly the same as in ordinary Q-VI started from the same point. The benefit is therefore only in the analysis, not in the induced decisions.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a joint spectral radius (JSR) framework for rank-one deflated Q-value iteration in discounted MDPs. It shows that every admissible subsystem in the standard Q-VI switching model shares the all-ones vector as an invariant direction (by row-stochasticity of the embedded kernels), yielding ambient JSR exactly equal to the discount factor γ. Passing to the quotient space orthogonal to this direction produces a projected switching system whose JSR governs the relevant error dynamics and may be strictly smaller than γ. The deflation step is proved algebraically equivalent to a scalar recentering of standard Q-VI, so that the projected trajectory and the induced greedy-policy sequence remain identical to those of standard Q-VI started from the same initial point. The claimed benefit is therefore a more precise geometric description of convergence rather than any change in the decision-making problem.

Significance. If the central claims hold, the work supplies the first JSR-based convergence analysis of deflated Q-VI and a clean geometric explanation for why removing the redundant all-ones component can tighten the rate bound. The algebraic equivalence result is a strength, as is the observation that the invariance property holds by construction for any row-stochastic transition kernel. The contribution is primarily conceptual and analytic rather than algorithmic; it refines the understanding of existing methods without altering their output.

major comments (2)

[main theorem / projected-system section] The claim that the projected JSR 'may be strictly smaller than γ' (abstract and main theorem) is load-bearing for the asserted sharper characterization, yet the manuscript provides neither an explicit construction of the projected switching matrices nor a concrete numerical example in which the reduction is computed and verified. Without such evidence the potential improvement remains unquantified.
[switching-system model definition] The derivation that the ambient JSR equals γ rests on the joint spectral radius of the family of matrices that all fix the all-ones vector; the manuscript should supply the precise definition of the admissible set of matrices (or policies) and the explicit matrix representation of the switching system before the quotient projection is taken, so that the invariance and the subsequent reduction can be checked directly.

minor comments (2)

[quotient-space paragraph] Notation for the quotient-space projection operator and the reduced state space should be introduced once and used consistently; currently the transition between ambient and projected quantities is described only informally.
[equivalence proof] The statement that 'the projected trajectory, and therefore the greedy-policy sequence, is unchanged' would benefit from an explicit one-line reference to the algebraic identity that maps the deflated update onto the recentered standard update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and valuable feedback on our manuscript. The suggestions to enhance the clarity of the switching system model and to provide concrete evidence for the potential improvement in the JSR bound are well-taken. We will incorporate revisions to address these points directly. Our point-by-point responses follow.

read point-by-point responses

Referee: [main theorem / projected-system section] The claim that the projected JSR 'may be strictly smaller than γ' (abstract and main theorem) is load-bearing for the asserted sharper characterization, yet the manuscript provides neither an explicit construction of the projected switching matrices nor a concrete numerical example in which the reduction is computed and verified. Without such evidence the potential improvement remains unquantified.

Authors: We concur that an explicit construction and numerical verification would strengthen the presentation of the main result. Accordingly, we will add to the revised manuscript an explicit description of the projected switching matrices obtained by quotienting out the all-ones direction. Furthermore, we will include a concrete numerical example based on a small MDP instance, where we explicitly compute the family of projected matrices, determine their joint spectral radius, and confirm that it is strictly less than the discount factor γ. This example will illustrate the potential for a sharper convergence characterization. revision: yes
Referee: [switching-system model definition] The derivation that the ambient JSR equals γ rests on the joint spectral radius of the family of matrices that all fix the all-ones vector; the manuscript should supply the precise definition of the admissible set of matrices (or policies) and the explicit matrix representation of the switching system before the quotient projection is taken, so that the invariance and the subsequent reduction can be checked directly.

Authors: We will expand the section on the switching-system model to provide the precise definition of the admissible set of matrices. Specifically, we will define the set as the collection of all possible Q-Bellman operators corresponding to deterministic stationary policies, expressed in matrix form. We will also present the explicit matrix representation of these operators prior to the projection step, emphasizing the property that each matrix is row-stochastic (up to the discount factor) and thus leaves the all-ones vector invariant. This addition will enable readers to directly verify the ambient JSR equaling γ and the validity of the subsequent reduction to the quotient space. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core derivation begins from the standard MDP fact that each fixed-policy operator A_π satisfies A_π 1 = γ 1 because the transition kernel is row-stochastic; this is an external modeling assumption, not an internal definition. The claim that the ambient switching system's JSR equals γ follows immediately from the definition of joint spectral radius once all subsystems share the same eigenvector with eigenvalue γ. The subsequent quotient-space projection, the possibility that the projected JSR is strictly smaller than γ, and the algebraic equivalence between deflation and scalar recentering are obtained by direct linear-algebraic manipulation of the error recursion without parameter fitting, without renaming known results, and without load-bearing self-citations. All steps remain self-contained against ordinary MDP theory and the definition of JSR.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard MDP assumptions and the modeling of Q-VI as a switching system; no free parameters are introduced and no new entities are postulated.

axioms (2)

domain assumption Q-value iteration updates can be represented as a finite set of linear subsystems whose selection depends on the current greedy policy.
This modeling choice enables the application of joint spectral radius theory.
domain assumption Every admissible subsystem leaves the all-ones vector invariant.
Invoked to conclude that the ambient JSR equals γ.

pith-pipeline@v0.9.0 · 5529 in / 1450 out tokens · 56493 ms · 2026-05-12T03:14:08.604390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Bertsekas

Dimitri P. Bertsekas. Generic rank-one corrections for value iteration in markovian decision problems.Operations Research Letters, 17(3):111–119, 1995

work page 1995
[2]

Dynamic programming and optimal control 4th edition, volume ii.Athena Scientific, 2015

Dimitri P Bertsekas. Dynamic programming and optimal control 4th edition, volume ii.Athena Scientific, 2015

work page 2015
[3]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific Belmont, MA, 1996

work page 1996
[4]

Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

Vincent D Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius.SIAM Journal on Matrix Analysis and Applications, 27(1):256–272, 2005

work page 2005
[5]

A. Cicone. A note on the joint spectral radius.arXiv preprint arXiv:1502.01506, 2015

work page arXiv 2015
[6]

Goyal and J

V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration.Operations Research, 71(2):517–535, 2023

work page 2023
[7]

J. P. Hespanha and A. S. Morse. Stability of switched systems with average dwell-time. In Proceedings of the 38th IEEE Conference on Decision and Control, pages 2655–2660, 1999

work page 1999
[8]

Springer Science & Business Media, 2009

Raphaël Jungers.The joint spectral radius: Theory and applications, volume 385. Springer Science & Business Media, 2009

work page 2009
[9]

A. S. Kolarijani, T. Ok, P. Mohajerin Esfahani, and M. A. S. Kolarijani. Rank-one modified value iteration. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 31182–31201, 2025

work page 2025
[10]

Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee. Beyond the bellman fixed point: geometry and fast policy identification in value iteration.arXiv preprint arXiv:2604.17457v4, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certified direct switching theory for q-learning.arXiv preprint arXiv:2604.19569v2, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

J. Lee, A. Rakhsha, E. K. Ryu, and A.-m. Farahmand. Deflated dynamics value iteration. Transactions on Machine Learning Research, 2025

work page 2025
[13]

Springer Science & Business Media, 2003

Daniel Liberzon.Switching in systems and control. Springer Science & Business Media, 2003

work page 2003
[14]

Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009

Hai Lin and Panos J Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results.IEEE Transactions on Automatic Control, 54(2):308–322, 2009. 35

work page 2009
[15]

Meyer.Matrix Analysis and Applied Linear Algebra

Carl D. Meyer.Matrix Analysis and Applied Linear Algebra. SIAM, 2000

work page 2000
[16]

Puterman.Markov decision processes: Discrete stochastic dynamic programming

Martin L. Puterman.Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[17]

A note on the joint spectral radius.Indag

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius.Indag. Math, 22(4): 379–381, 1960

work page 1960
[18]

Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems.SIAM Review, 49(4):545–592, 2007

work page 2007
[19]

John N Tsitsiklis and Vincent D Blondel. The lyapunov exponent and joint spectral radius of pairs of matrices are hard—when not impossible—to compute and to approximate.Mathematics of Control, Signals and Systems, 10(1):31–40, 1997. 36

work page 1997