Geometrically Averaged Hard Target Updates for Linear Q-Learning
Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3
The pith
The λ-target update stabilizes linear Q-learning by geometrically averaging periodic target maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The λ-target update, obtained by averaging the m-periodic target update maps with λ-geometric weights (1-λ)λ^{m-1} where λ ∈ [0,1], improves stability in linear Q-learning as analyzed via a switching-system model. The endpoint λ=0 recovers the one-period target update while λ approaching 1 recovers projected Q-value iteration. The paper treats the deterministic version for clarity while stating that the formulation extends to stochastic settings.
What carries the argument
The λ-target update, a geometrically weighted average of m-periodic hard target update maps.
If this is right
- Different choices of λ produce different stability margins under the switching-system analysis.
- The method continuously interpolates between common target-update heuristics and projected value iteration.
- The deterministic analysis supplies a foundation for applying the update in stochastic linear Q-learning.
- Periodic hard target updates can be replaced by this parameterized geometric average without changing the endpoint behaviors.
Where Pith is reading between the lines
- The same geometric averaging idea could be tested on nonlinear function approximators to check whether the stability pattern persists.
- The switching-system viewpoint might be applied to analyze other stabilization devices such as soft target updates or replay buffer modifications.
- Hyperparameter schedules that vary λ over training could be explored as a direct extension of the fixed-λ family.
Load-bearing premise
The deterministic switching-system model and its stability conclusions carry over when transitions and rewards are stochastic.
What would settle it
A linear Q-learning run in a stochastic environment where an intermediate λ produces divergence or oscillation while the switching-system analysis for the corresponding deterministic system predicts convergence.
Figures
read the original abstract
Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $\lambda$-target update, obtained by averaging the $m$-periodic target update maps with $\lambda$-geometric weights $(1-\lambda)\lambda^{m-1}$, $\lambda \in [0,1]$. The endpoint $\lambda=0$ recovers the one-period target update, while the continuous endpoint $\lambda\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the λ-target update for linear Q-learning, obtained by averaging m-periodic hard target update maps with λ-geometric weights (1-λ)λ^{m-1}. It analyzes stability via a switching-system model in the deterministic setting (recovering one-period updates at λ=0 and projected Q-value iteration as λ↑1) and asserts that the formulation extends to stochastic RL.
Significance. If the switching-system stability analysis is sound and the deterministic-to-stochastic transfer holds, the λ-parameterization would supply a continuous bridge between common hard-target heuristics and value iteration, offering a tunable stabilization device for linear function approximation in Q-learning.
major comments (1)
- [Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.
minor comments (1)
- [Abstract] Abstract: the phrase 'using a switching-system model and related tools' is vague; naming the specific tools (e.g., joint spectral radius, common Lyapunov functions) would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the detailed review. We agree that the unsupported claim regarding stochastic extension must be addressed and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.
Authors: We agree that the manuscript provides no analysis (Lyapunov, contraction, or perturbation) transferring the deterministic switching-system stability results to the stochastic case. The paper explicitly states it treats the deterministic version 'for clarity,' and the final abstract sentence is an unsupported assertion. We will revise the abstract (and remove any similar phrasing elsewhere) to state only that the update rule itself is well-defined for stochastic settings, while the stability analysis is restricted to the deterministic case. No claim of stability transfer will remain. revision: yes
Circularity Check
No circularity: new λ-target definition and switching-system analysis are independent of inputs
full rationale
The paper defines the λ-target update explicitly as the geometric average of m-periodic hard target maps and then applies a switching-system stability analysis to the resulting deterministic linear Q-learning dynamics. No equation reduces to a prior fitted parameter or self-referential definition, no prediction is obtained by refitting a subset of the same data, and no load-bearing step invokes a self-citation whose content is itself unverified. The deterministic-to-stochastic extension is asserted rather than derived, but this is a gap in justification, not a circular reduction of the claimed result to its own inputs. The derivation chain therefore remains self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bertsekas and John N
Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996
1996
-
[2]
Blondel and Yurii Nesterov
Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius. SIAM Journal on Matrix Analysis and Applications, 27(1):256--272, 2005
2005
-
[3]
Ramirez, Christopher K
Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over-parameterization stabilize off-policy bootstrapping with function approximation. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machin...
2024
-
[4]
Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science, 5(4):1078--1101, 2023. doi:10.1137/22M1499261
-
[5]
Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9886--9909. PMLR, 2023
2023
-
[6]
Tauberian Theory: A Century of Developments
Jacob Korevaar. Tauberian Theory: A Century of Developments. Grundlehren der mathematischen Wissenschaften, volume 329. Springer, Berlin, 2004. doi:10.1007/978-3-662-10225-1
-
[7]
Continuity of the joint spectral radius: application to wavelets
Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors, Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, pages 51--61. Springer, New York, 1995
1995
-
[8]
Generating functions of switched linear systems: analysis, computation, and stability applications
Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059--1074, 2011. doi:10.1109/TAC.2010.2067590
-
[9]
Jordan, and Satinder P
Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, volume 6, pages 703--710. Morgan Kaufmann, 1994
1994
-
[10]
The Joint Spectral Radius: Theory and Applications
Rapha \"e l Jungers. The Joint Spectral Radius: Theory and Applications. Lecture Notes in Control and Information Sciences, volume 385. Springer, 2009
2009
-
[11]
Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics
Donghwan Lee. Target updates may stabilize linear Q-learning: Periodic and soft dynamics. arXiv preprint arXiv:2606.02645, 2026. doi:10.48550/arXiv.2606.02645. https://arxiv.org/pdf/2606.02645
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2606.02645 2026
-
[12]
Lyapunov-Certified Direct Switching Theory for Q-Learning
Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026. doi:10.48550/arXiv.2604.19569. https://arxiv.org/pdf/2604.19569
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.19569 2026
-
[13]
Target-based temporal-difference learning
Donghwan Lee and Niao He. Target-based temporal-difference learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3713--3722. PMLR, 2019
2019
-
[14]
Periodic Q-learning
Donghwan Lee and Niao He. Periodic Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pages 582--598. PMLR, 2020
2020
-
[15]
A discrete-time switching system analysis of Q-learning
Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning. SIAM Journal on Control and Optimization, 61(3):1861--1880, 2023
2023
-
[16]
A Switching System Theory of Q-Learning with Linear Function Approximation
Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026. doi:10.48550/arXiv.2605.11021. https://arxiv.org/pdf/2605.11021
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.11021 2026
-
[17]
Switching in Systems and Control
Daniel Liberzon. Switching in Systems and Control. Birkh\"auser, Boston, MA, 2003
2003
-
[18]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. arXiv:1509.02971
Pith/arXiv arXiv 2016
-
[19]
Regularized Q-learning
Han-Dong Lim and Donghwan Lee. Regularized Q-learning. In Advances in Neural Information Processing Systems, volume 37, pages 129855--129887, 2024
2024
-
[20]
Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025. doi:10.48550/arXiv.2504.10865. https://arxiv.org/pdf/2504.10865
-
[21]
Antsaklis
Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results. IEEE Transactions on Automatic Control, 54(2):308--322, 2009
2009
-
[22]
Sean P. Meyn. The projected Bellman equation in reinforcement learning. IEEE Transactions on Automatic Control, 69(12):8323--8337, 2024. doi:10.1109/TAC.2024.3409647
-
[23]
Rusu, Joel Veness, Marc G
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...
2015
-
[24]
Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. Machine Learning, 22(1--3):283--290, 1996. doi:10.1007/BF00114731
-
[25]
Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992. doi:10.1137/0330046
-
[26]
Puterman
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994
1994
-
[27]
A note on the joint spectral radius
Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius. Indagationes Mathematicae, 22:379--381, 1960
1960
-
[28]
Iterative Methods for Sparse Linear Systems
Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2003. doi:10.1137/1.9780898718003
-
[29]
Stability criteria for switched and hybrid systems
Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems. SIAM Review, 49(4):545--592, 2007
2007
-
[30]
Richard S. Sutton, Hamid R. Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv\'ari, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993--1000. ACM, 2009. doi:10.1145/1553374.1553501
-
[31]
Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9--44, 1988. doi:10.1007/BF00115009
-
[32]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998
1998
-
[33]
Tsitsiklis
John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185--202, 1994
1994
-
[34]
Richard S. Varga. Matrix Iterative Analysis. Springer Series in Computational Mathematics, volume 27. Springer-Verlag, Berlin, second revised and expanded edition, 2000. doi:10.1007/978-3-642-05156-2
-
[35]
Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, University of Cambridge, 1989
1989
-
[36]
Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3--4):279--292, 1992. doi:10.1007/BF00992698
-
[37]
Breaking the deadly triad with a target network
Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12621--12631. PMLR, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.