Geometrically Averaged Hard Target Updates for Linear Q-Learning

Donghwan Lee

arxiv: 2606.10835 · v1 · pith:BRMBSMR4new · submitted 2026-06-09 · 💻 cs.LG · cs.AI

Geometrically Averaged Hard Target Updates for Linear Q-Learning

Donghwan Lee This is my paper

Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords target updatesQ-learninglinear function approximationstability analysisswitching systemsreinforcement learninggeometric averaging

0 comments

The pith

The λ-target update stabilizes linear Q-learning by geometrically averaging periodic target maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the λ-target update for linear Q-learning, formed by averaging m-periodic target update maps with geometric weights (1-λ)λ^{m-1}. This family of updates is studied through a switching-system model in the deterministic case. The rule recovers the standard one-period target update when λ=0 and projected Q-value iteration as λ approaches 1. The analysis indicates that suitable λ values can improve stability compared to fixed periodic updates. The deterministic formulation is presented with the claim that it extends to stochastic reinforcement-learning settings.

Core claim

The λ-target update, obtained by averaging the m-periodic target update maps with λ-geometric weights (1-λ)λ^{m-1} where λ ∈ [0,1], improves stability in linear Q-learning as analyzed via a switching-system model. The endpoint λ=0 recovers the one-period target update while λ approaching 1 recovers projected Q-value iteration. The paper treats the deterministic version for clarity while stating that the formulation extends to stochastic settings.

What carries the argument

The λ-target update, a geometrically weighted average of m-periodic hard target update maps.

If this is right

Different choices of λ produce different stability margins under the switching-system analysis.
The method continuously interpolates between common target-update heuristics and projected value iteration.
The deterministic analysis supplies a foundation for applying the update in stochastic linear Q-learning.
Periodic hard target updates can be replaced by this parameterized geometric average without changing the endpoint behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric averaging idea could be tested on nonlinear function approximators to check whether the stability pattern persists.
The switching-system viewpoint might be applied to analyze other stabilization devices such as soft target updates or replay buffer modifications.
Hyperparameter schedules that vary λ over training could be explored as a direct extension of the fixed-λ family.

Load-bearing premise

The deterministic switching-system model and its stability conclusions carry over when transitions and rewards are stochastic.

What would settle it

A linear Q-learning run in a stochastic environment where an intermediate λ produces divergence or oscillation while the switching-system analysis for the corresponding deterministic system predicts convergence.

Figures

Figures reproduced from arXiv: 2606.10835 by Donghwan Lee.

**Figure 1.** Figure 1: λ-DLQL target parameterization of the hard-target endpoints. The parameter λ = 0 recovers the period-one DLQL boundary update, while the continuous endpoint λ → 1 corresponds to the infinite-period PQVI limit. The interpolation in [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

read the original abstract

Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $\lambda$-target update, obtained by averaging the $m$-periodic target update maps with $\lambda$-geometric weights $(1-\lambda)\lambda^{m-1}$, $\lambda \in [0,1]$. The endpoint $\lambda=0$ recovers the one-period target update, while the continuous endpoint $\lambda\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New λ-geometric averaging of hard targets for linear Q-learning, but the stability claim for stochastic RL rests on an unshown transfer from the deterministic switching-system analysis.

read the letter

The paper introduces the λ-target update: a geometric average of m-periodic hard target maps with weights (1-λ)λ^{m-1}. This gives a continuous parameter that recovers ordinary periodic hard updates at λ=0 and projected Q-value iteration as λ approaches 1. That interpolation is the concrete new object.

The analysis uses a switching-system model on the deterministic linear Q-learning dynamics. The setup is standard for this corner of the literature and lets them treat the target update as a switched linear system whose stability can be checked with existing tools. That part is cleanly executed on its own terms.

The soft spot is the move to stochastic settings. The abstract states that the deterministic treatment is for clarity and that the formulation extends to stochastic reinforcement learning, yet supplies no contraction argument, Lyapunov function, or even a sketch showing why the stability conclusions survive when the Bellman operator is replaced by a random operator driven by stochastic transitions and rewards. Linear Q-learning in practice is stochastic, so this missing step limits how far the headline stability claim travels.

No experiments or numerical checks appear in the abstract, which keeps the work purely theoretical. The scope is deliberately narrow—linear approximation only—so it does not claim to speak to deep RL or general function approximation.

This is for readers already working on target networks, periodic updates, or switched-system analyses inside linear RL. A specialist in approximate dynamic programming might find the interpolation device worth testing in their own proofs.

The deterministic analysis looks like it could stand up to referee scrutiny, so the paper deserves a serious review to verify the switching-system results and to press on whether the stochastic extension can be made rigorous or needs to be qualified. I would send it out rather than desk-reject, with a clear request that the authors address the transfer step.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the λ-target update for linear Q-learning, obtained by averaging m-periodic hard target update maps with λ-geometric weights (1-λ)λ^{m-1}. It analyzes stability via a switching-system model in the deterministic setting (recovering one-period updates at λ=0 and projected Q-value iteration as λ↑1) and asserts that the formulation extends to stochastic RL.

Significance. If the switching-system stability analysis is sound and the deterministic-to-stochastic transfer holds, the λ-parameterization would supply a continuous bridge between common hard-target heuristics and value iteration, offering a tunable stabilization device for linear function approximation in Q-learning.

major comments (1)

[Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.

minor comments (1)

[Abstract] Abstract: the phrase 'using a switching-system model and related tools' is vague; naming the specific tools (e.g., joint spectral radius, common Lyapunov functions) would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. We agree that the unsupported claim regarding stochastic extension must be addressed and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.

Authors: We agree that the manuscript provides no analysis (Lyapunov, contraction, or perturbation) transferring the deterministic switching-system stability results to the stochastic case. The paper explicitly states it treats the deterministic version 'for clarity,' and the final abstract sentence is an unsupported assertion. We will revise the abstract (and remove any similar phrasing elsewhere) to state only that the update rule itself is well-defined for stochastic settings, while the stability analysis is restricted to the deterministic case. No claim of stability transfer will remain. revision: yes

Circularity Check

0 steps flagged

No circularity: new λ-target definition and switching-system analysis are independent of inputs

full rationale

The paper defines the λ-target update explicitly as the geometric average of m-periodic hard target maps and then applies a switching-system stability analysis to the resulting deterministic linear Q-learning dynamics. No equation reduces to a prior fitted parameter or self-referential definition, no prediction is obtained by refitting a subset of the same data, and no load-bearing step invokes a self-citation whose content is itself unverified. The deterministic-to-stochastic extension is asserted rather than derived, but this is a gap in justification, not a circular reduction of the claimed result to its own inputs. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the switching-system model is invoked but its assumptions are not detailed.

pith-pipeline@v0.9.1-grok · 5658 in / 1003 out tokens · 20305 ms · 2026-06-27T13:40:28.338725+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996

1996
[2]

Blondel and Yurii Nesterov

Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius. SIAM Journal on Matrix Analysis and Applications, 27(1):256--272, 2005

2005
[3]

Ramirez, Christopher K

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over-parameterization stabilize off-policy bootstrapping with function approximation. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machin...

2024
[4]

Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science, 5(4):1078--1101, 2023. doi:10.1137/22M1499261

work page doi:10.1137/22m1499261 2023
[5]

Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9886--9909. PMLR, 2023

2023
[6]

Tauberian Theory: A Century of Developments

Jacob Korevaar. Tauberian Theory: A Century of Developments. Grundlehren der mathematischen Wissenschaften, volume 329. Springer, Berlin, 2004. doi:10.1007/978-3-662-10225-1

work page doi:10.1007/978-3-662-10225-1 2004
[7]

Continuity of the joint spectral radius: application to wavelets

Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors, Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, pages 51--61. Springer, New York, 1995

1995
[8]

Generating functions of switched linear systems: analysis, computation, and stability applications

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059--1074, 2011. doi:10.1109/TAC.2010.2067590

work page doi:10.1109/tac.2010.2067590 2011
[9]

Jordan, and Satinder P

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, volume 6, pages 703--710. Morgan Kaufmann, 1994

1994
[10]

The Joint Spectral Radius: Theory and Applications

Rapha \"e l Jungers. The Joint Spectral Radius: Theory and Applications. Lecture Notes in Control and Information Sciences, volume 385. Springer, 2009

2009
[11]

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

Donghwan Lee. Target updates may stabilize linear Q-learning: Periodic and soft dynamics. arXiv preprint arXiv:2606.02645, 2026. doi:10.48550/arXiv.2606.02645. https://arxiv.org/pdf/2606.02645

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.02645 2026
[12]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026. doi:10.48550/arXiv.2604.19569. https://arxiv.org/pdf/2604.19569

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.19569 2026
[13]

Target-based temporal-difference learning

Donghwan Lee and Niao He. Target-based temporal-difference learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3713--3722. PMLR, 2019

2019
[14]

Periodic Q-learning

Donghwan Lee and Niao He. Periodic Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pages 582--598. PMLR, 2020

2020
[15]

A discrete-time switching system analysis of Q-learning

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning. SIAM Journal on Control and Optimization, 61(3):1861--1880, 2023

2023
[16]

A Switching System Theory of Q-Learning with Linear Function Approximation

Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026. doi:10.48550/arXiv.2605.11021. https://arxiv.org/pdf/2605.11021

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.11021 2026
[17]

Switching in Systems and Control

Daniel Liberzon. Switching in Systems and Control. Birkh\"auser, Boston, MA, 2003

2003
[18]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. arXiv:1509.02971

Pith/arXiv arXiv 2016
[19]

Regularized Q-learning

Han-Dong Lim and Donghwan Lee. Regularized Q-learning. In Advances in Neural Information Processing Systems, volume 37, pages 129855--129887, 2024

2024
[20]

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025. doi:10.48550/arXiv.2504.10865. https://arxiv.org/pdf/2504.10865

work page doi:10.48550/arxiv.2504.10865 2025
[21]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results. IEEE Transactions on Automatic Control, 54(2):308--322, 2009

2009
[22]

Sean P. Meyn. The projected Bellman equation in reinforcement learning. IEEE Transactions on Automatic Control, 69(12):8323--8337, 2024. doi:10.1109/TAC.2024.3409647

work page doi:10.1109/tac.2024.3409647 2024
[23]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

2015
[24]

Williams

Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. Machine Learning, 22(1--3):283--290, 1996. doi:10.1007/BF00114731

work page doi:10.1007/bf00114731 1996
[25]

Polyak and Anatoli B

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992. doi:10.1137/0330046

work page doi:10.1137/0330046 1992
[26]

Puterman

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

1994
[27]

A note on the joint spectral radius

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius. Indagationes Mathematicae, 22:379--381, 1960

1960
[28]

Iterative Methods for Sparse Linear Systems

Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2003. doi:10.1137/1.9780898718003

work page doi:10.1137/1.9780898718003 2003
[29]

Stability criteria for switched and hybrid systems

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems. SIAM Review, 49(4):545--592, 2007

2007
[30]

Sutton, Hamid R

Richard S. Sutton, Hamid R. Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv\'ari, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993--1000. ACM, 2009. doi:10.1145/1553374.1553501

work page doi:10.1145/1553374.1553501 2009
[31]

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9--44, 1988. doi:10.1007/BF00115009

work page doi:10.1007/bf00115009 1988
[32]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998

1998
[33]

Tsitsiklis

John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185--202, 1994

1994
[34]

Richard S. Varga. Matrix Iterative Analysis. Springer Series in Computational Mathematics, volume 27. Springer-Verlag, Berlin, second revised and expanded edition, 2000. doi:10.1007/978-3-642-05156-2

work page doi:10.1007/978-3-642-05156-2 2000
[35]

Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, University of Cambridge, 1989

1989
[36]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3--4):279--292, 1992. doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992
[37]

Breaking the deadly triad with a target network

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12621--12631. PMLR, 2021

2021

[1] [1]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996

1996

[2] [2]

Blondel and Yurii Nesterov

Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius. SIAM Journal on Matrix Analysis and Applications, 27(1):256--272, 2005

2005

[3] [3]

Ramirez, Christopher K

Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over-parameterization stabilize off-policy bootstrapping with function approximation. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machin...

2024

[4] [4]

Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science, 5(4):1078--1101, 2023. doi:10.1137/22M1499261

work page doi:10.1137/22m1499261 2023

[5] [5]

Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9886--9909. PMLR, 2023

2023

[6] [6]

Tauberian Theory: A Century of Developments

Jacob Korevaar. Tauberian Theory: A Century of Developments. Grundlehren der mathematischen Wissenschaften, volume 329. Springer, Berlin, 2004. doi:10.1007/978-3-662-10225-1

work page doi:10.1007/978-3-662-10225-1 2004

[7] [7]

Continuity of the joint spectral radius: application to wavelets

Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors, Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, pages 51--61. Springer, New York, 1995

1995

[8] [8]

Generating functions of switched linear systems: analysis, computation, and stability applications

Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059--1074, 2011. doi:10.1109/TAC.2010.2067590

work page doi:10.1109/tac.2010.2067590 2011

[9] [9]

Jordan, and Satinder P

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, volume 6, pages 703--710. Morgan Kaufmann, 1994

1994

[10] [10]

The Joint Spectral Radius: Theory and Applications

Rapha \"e l Jungers. The Joint Spectral Radius: Theory and Applications. Lecture Notes in Control and Information Sciences, volume 385. Springer, 2009

2009

[11] [11]

Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

Donghwan Lee. Target updates may stabilize linear Q-learning: Periodic and soft dynamics. arXiv preprint arXiv:2606.02645, 2026. doi:10.48550/arXiv.2606.02645. https://arxiv.org/pdf/2606.02645

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.02645 2026

[12] [12]

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026. doi:10.48550/arXiv.2604.19569. https://arxiv.org/pdf/2604.19569

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.19569 2026

[13] [13]

Target-based temporal-difference learning

Donghwan Lee and Niao He. Target-based temporal-difference learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3713--3722. PMLR, 2019

2019

[14] [14]

Periodic Q-learning

Donghwan Lee and Niao He. Periodic Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pages 582--598. PMLR, 2020

2020

[15] [15]

A discrete-time switching system analysis of Q-learning

Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning. SIAM Journal on Control and Optimization, 61(3):1861--1880, 2023

2023

[16] [16]

A Switching System Theory of Q-Learning with Linear Function Approximation

Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026. doi:10.48550/arXiv.2605.11021. https://arxiv.org/pdf/2605.11021

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.11021 2026

[17] [17]

Switching in Systems and Control

Daniel Liberzon. Switching in Systems and Control. Birkh\"auser, Boston, MA, 2003

2003

[18] [18]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. arXiv:1509.02971

Pith/arXiv arXiv 2016

[19] [19]

Regularized Q-learning

Han-Dong Lim and Donghwan Lee. Regularized Q-learning. In Advances in Neural Information Processing Systems, volume 37, pages 129855--129887, 2024

2024

[20] [20]

Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025. doi:10.48550/arXiv.2504.10865. https://arxiv.org/pdf/2504.10865

work page doi:10.48550/arxiv.2504.10865 2025

[21] [21]

Antsaklis

Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results. IEEE Transactions on Automatic Control, 54(2):308--322, 2009

2009

[22] [22]

Sean P. Meyn. The projected Bellman equation in reinforcement learning. IEEE Transactions on Automatic Control, 69(12):8323--8337, 2024. doi:10.1109/TAC.2024.3409647

work page doi:10.1109/tac.2024.3409647 2024

[23] [23]

Rusu, Joel Veness, Marc G

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

2015

[24] [24]

Williams

Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. Machine Learning, 22(1--3):283--290, 1996. doi:10.1007/BF00114731

work page doi:10.1007/bf00114731 1996

[25] [25]

Polyak and Anatoli B

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992. doi:10.1137/0330046

work page doi:10.1137/0330046 1992

[26] [26]

Puterman

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

1994

[27] [27]

A note on the joint spectral radius

Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius. Indagationes Mathematicae, 22:379--381, 1960

1960

[28] [28]

Iterative Methods for Sparse Linear Systems

Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2003. doi:10.1137/1.9780898718003

work page doi:10.1137/1.9780898718003 2003

[29] [29]

Stability criteria for switched and hybrid systems

Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems. SIAM Review, 49(4):545--592, 2007

2007

[30] [30]

Sutton, Hamid R

Richard S. Sutton, Hamid R. Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv\'ari, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993--1000. ACM, 2009. doi:10.1145/1553374.1553501

work page doi:10.1145/1553374.1553501 2009

[31] [31]

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9--44, 1988. doi:10.1007/BF00115009

work page doi:10.1007/bf00115009 1988

[32] [32]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998

1998

[33] [33]

Tsitsiklis

John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185--202, 1994

1994

[34] [34]

Richard S. Varga. Matrix Iterative Analysis. Springer Series in Computational Mathematics, volume 27. Springer-Verlag, Berlin, second revised and expanded edition, 2000. doi:10.1007/978-3-642-05156-2

work page doi:10.1007/978-3-642-05156-2 2000

[35] [35]

Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, University of Cambridge, 1989

1989

[36] [36]

Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3--4):279--292, 1992. doi:10.1007/BF00992698

work page doi:10.1007/bf00992698 1992

[37] [37]

Breaking the deadly triad with a target network

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12621--12631. PMLR, 2021

2021