pith. sign in

arxiv: 2606.10835 · v1 · pith:BRMBSMR4new · submitted 2026-06-09 · 💻 cs.LG · cs.AI

Geometrically Averaged Hard Target Updates for Linear Q-Learning

Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords target updatesQ-learninglinear function approximationstability analysisswitching systemsreinforcement learninggeometric averaging
0
0 comments X

The pith

The λ-target update stabilizes linear Q-learning by geometrically averaging periodic target maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the λ-target update for linear Q-learning, formed by averaging m-periodic target update maps with geometric weights (1-λ)λ^{m-1}. This family of updates is studied through a switching-system model in the deterministic case. The rule recovers the standard one-period target update when λ=0 and projected Q-value iteration as λ approaches 1. The analysis indicates that suitable λ values can improve stability compared to fixed periodic updates. The deterministic formulation is presented with the claim that it extends to stochastic reinforcement-learning settings.

Core claim

The λ-target update, obtained by averaging the m-periodic target update maps with λ-geometric weights (1-λ)λ^{m-1} where λ ∈ [0,1], improves stability in linear Q-learning as analyzed via a switching-system model. The endpoint λ=0 recovers the one-period target update while λ approaching 1 recovers projected Q-value iteration. The paper treats the deterministic version for clarity while stating that the formulation extends to stochastic settings.

What carries the argument

The λ-target update, a geometrically weighted average of m-periodic hard target update maps.

If this is right

  • Different choices of λ produce different stability margins under the switching-system analysis.
  • The method continuously interpolates between common target-update heuristics and projected value iteration.
  • The deterministic analysis supplies a foundation for applying the update in stochastic linear Q-learning.
  • Periodic hard target updates can be replaced by this parameterized geometric average without changing the endpoint behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric averaging idea could be tested on nonlinear function approximators to check whether the stability pattern persists.
  • The switching-system viewpoint might be applied to analyze other stabilization devices such as soft target updates or replay buffer modifications.
  • Hyperparameter schedules that vary λ over training could be explored as a direct extension of the fixed-λ family.

Load-bearing premise

The deterministic switching-system model and its stability conclusions carry over when transitions and rewards are stochastic.

What would settle it

A linear Q-learning run in a stochastic environment where an intermediate λ produces divergence or oscillation while the switching-system analysis for the corresponding deterministic system predicts convergence.

Figures

Figures reproduced from arXiv: 2606.10835 by Donghwan Lee.

Figure 1
Figure 1. Figure 1: λ-DLQL target parameterization of the hard-target endpoints. The parameter λ = 0 recovers the period-one DLQL boundary update, while the continuous endpoint λ → 1 corresponds to the infinite-period PQVI limit. The interpolation in [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
read the original abstract

Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $\lambda$-target update, obtained by averaging the $m$-periodic target update maps with $\lambda$-geometric weights $(1-\lambda)\lambda^{m-1}$, $\lambda \in [0,1]$. The endpoint $\lambda=0$ recovers the one-period target update, while the continuous endpoint $\lambda\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the λ-target update for linear Q-learning, obtained by averaging m-periodic hard target update maps with λ-geometric weights (1-λ)λ^{m-1}. It analyzes stability via a switching-system model in the deterministic setting (recovering one-period updates at λ=0 and projected Q-value iteration as λ↑1) and asserts that the formulation extends to stochastic RL.

Significance. If the switching-system stability analysis is sound and the deterministic-to-stochastic transfer holds, the λ-parameterization would supply a continuous bridge between common hard-target heuristics and value iteration, offering a tunable stabilization device for linear function approximation in Q-learning.

major comments (1)
  1. [Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'using a switching-system model and related tools' is vague; naming the specific tools (e.g., joint spectral radius, common Lyapunov functions) would improve immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review. We agree that the unsupported claim regarding stochastic extension must be addressed and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Stochastic Extension] Abstract (final sentence) and any section presenting the stochastic extension: the claim that 'the formulation extends to stochastic reinforcement-learning settings' is unsupported; no Lyapunov argument, contraction mapping, or perturbation analysis is supplied showing that the deterministic switching-system stability conclusions survive replacement of the Bellman operator by a random operator driven by stochastic transitions and rewards. Because the headline stability claim targets practical (stochastic) linear Q-learning, this missing transfer step is load-bearing.

    Authors: We agree that the manuscript provides no analysis (Lyapunov, contraction, or perturbation) transferring the deterministic switching-system stability results to the stochastic case. The paper explicitly states it treats the deterministic version 'for clarity,' and the final abstract sentence is an unsupported assertion. We will revise the abstract (and remove any similar phrasing elsewhere) to state only that the update rule itself is well-defined for stochastic settings, while the stability analysis is restricted to the deterministic case. No claim of stability transfer will remain. revision: yes

Circularity Check

0 steps flagged

No circularity: new λ-target definition and switching-system analysis are independent of inputs

full rationale

The paper defines the λ-target update explicitly as the geometric average of m-periodic hard target maps and then applies a switching-system stability analysis to the resulting deterministic linear Q-learning dynamics. No equation reduces to a prior fitted parameter or self-referential definition, no prediction is obtained by refitting a subset of the same data, and no load-bearing step invokes a self-citation whose content is itself unverified. The deterministic-to-stochastic extension is asserted rather than derived, but this is a gap in justification, not a circular reduction of the claimed result to its own inputs. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the switching-system model is invoked but its assumptions are not detailed.

pith-pipeline@v0.9.1-grok · 5658 in / 1003 out tokens · 20305 ms · 2026-06-27T13:40:28.338725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996

  2. [2]

    Blondel and Yurii Nesterov

    Vincent D. Blondel and Yurii Nesterov. Computationally efficient approximations of the joint spectral radius. SIAM Journal on Matrix Analysis and Applications, 27(1):256--272, 2005

  3. [3]

    Ramirez, Christopher K

    Fengdi Che, Chenjun Xiao, Jincheng Mei, Bo Dai, Ramki Gummadi, Oscar A. Ramirez, Christopher K. Harris, A. Rupam Mahmood, and Dale Schuurmans. Target networks and over-parameterization stabilize off-policy bootstrapping with function approximation. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machin...

  4. [4]

    Target network and truncation overcome the deadly triad in Q-learning.SIAM Journal on Mathematics of Data Science, 5(4):1078–1101, 2023

    Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in Q-learning. SIAM Journal on Mathematics of Data Science, 5(4):1078--1101, 2023. doi:10.1137/22M1499261

  5. [5]

    Mattie Fellows, Matthew J. A. Smith, and Shimon Whiteson. Why target networks stabilise temporal difference methods. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 9886--9909. PMLR, 2023

  6. [6]

    Tauberian Theory: A Century of Developments

    Jacob Korevaar. Tauberian Theory: A Century of Developments. Grundlehren der mathematischen Wissenschaften, volume 329. Springer, Berlin, 2004. doi:10.1007/978-3-662-10225-1

  7. [7]

    Continuity of the joint spectral radius: application to wavelets

    Christopher Heil and Gilbert Strang. Continuity of the joint spectral radius: application to wavelets. In A. Bojanczyk and G. Cybenko, editors, Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, pages 51--61. Springer, New York, 1995

  8. [8]

    Generating functions of switched linear systems: analysis, computation, and stability applications

    Jianghai Hu, Jinglai Shen, and Wei Zhang. Generating functions of switched linear systems: analysis, computation, and stability applications. IEEE Transactions on Automatic Control, 56(5):1059--1074, 2011. doi:10.1109/TAC.2010.2067590

  9. [9]

    Jordan, and Satinder P

    Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, volume 6, pages 703--710. Morgan Kaufmann, 1994

  10. [10]

    The Joint Spectral Radius: Theory and Applications

    Rapha \"e l Jungers. The Joint Spectral Radius: Theory and Applications. Lecture Notes in Control and Information Sciences, volume 385. Springer, 2009

  11. [11]

    Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics

    Donghwan Lee. Target updates may stabilize linear Q-learning: Periodic and soft dynamics. arXiv preprint arXiv:2606.02645, 2026. doi:10.48550/arXiv.2606.02645. https://arxiv.org/pdf/2606.02645

  12. [12]

    Lyapunov-Certified Direct Switching Theory for Q-Learning

    Donghwan Lee. Lyapunov-certified direct switching theory for Q-learning. arXiv preprint arXiv:2604.19569, 2026. doi:10.48550/arXiv.2604.19569. https://arxiv.org/pdf/2604.19569

  13. [13]

    Target-based temporal-difference learning

    Donghwan Lee and Niao He. Target-based temporal-difference learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3713--3722. PMLR, 2019

  14. [14]

    Periodic Q-learning

    Donghwan Lee and Niao He. Periodic Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pages 582--598. PMLR, 2020

  15. [15]

    A discrete-time switching system analysis of Q-learning

    Donghwan Lee, Jianghai Hu, and Niao He. A discrete-time switching system analysis of Q-learning. SIAM Journal on Control and Optimization, 61(3):1861--1880, 2023

  16. [16]

    A Switching System Theory of Q-Learning with Linear Function Approximation

    Donghwan Lee and Han-Dong Lim. A switching system theory of Q-learning with linear function approximation. arXiv preprint arXiv:2605.11021, 2026. doi:10.48550/arXiv.2605.11021. https://arxiv.org/pdf/2605.11021

  17. [17]

    Switching in Systems and Control

    Daniel Liberzon. Switching in Systems and Control. Birkh\"auser, Boston, MA, 2003

  18. [18]

    Lillicrap, Jonathan J

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. arXiv:1509.02971

  19. [19]

    Regularized Q-learning

    Han-Dong Lim and Donghwan Lee. Regularized Q-learning. In Advances in Neural Information Processing Systems, volume 37, pages 129855--129887, 2024

  20. [20]

    Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration

    Han-Dong Lim and Donghwan Lee. Understanding the theoretical properties of projected Bellman equation, linear Q-learning, and approximate value iteration. arXiv preprint arXiv:2504.10865, 2025. doi:10.48550/arXiv.2504.10865. https://arxiv.org/pdf/2504.10865

  21. [21]

    Antsaklis

    Hai Lin and Panos J. Antsaklis. Stability and stabilizability of switched linear systems: A survey of recent results. IEEE Transactions on Automatic Control, 54(2):308--322, 2009

  22. [22]

    Sean P. Meyn. The projected Bellman equation in reinforcement learning. IEEE Transactions on Automatic Control, 69(12):8323--8337, 2024. doi:10.1109/TAC.2024.3409647

  23. [23]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

  24. [24]

    Williams

    Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. Machine Learning, 22(1--3):283--290, 1996. doi:10.1007/BF00114731

  25. [25]

    Polyak and Anatoli B

    Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838--855, 1992. doi:10.1137/0330046

  26. [26]

    Puterman

    Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, 1994

  27. [27]

    A note on the joint spectral radius

    Gian-Carlo Rota and Gilbert Strang. A note on the joint spectral radius. Indagationes Mathematicae, 22:379--381, 1960

  28. [28]

    Iterative Methods for Sparse Linear Systems

    Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, second edition, 2003. doi:10.1137/1.9780898718003

  29. [29]

    Stability criteria for switched and hybrid systems

    Robert Shorten, Fabian Wirth, Oliver Mason, Kai Wulff, and Christopher King. Stability criteria for switched and hybrid systems. SIAM Review, 49(4):545--592, 2007

  30. [30]

    Sutton, Hamid R

    Richard S. Sutton, Hamid R. Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv\'ari, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993--1000. ACM, 2009. doi:10.1145/1553374.1553501

  31. [31]

    Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9--44, 1988. doi:10.1007/BF00115009

  32. [32]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998

  33. [33]

    Tsitsiklis

    John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185--202, 1994

  34. [34]

    Richard S. Varga. Matrix Iterative Analysis. Springer Series in Computational Mathematics, volume 27. Springer-Verlag, Berlin, second revised and expanded edition, 2000. doi:10.1007/978-3-642-05156-2

  35. [35]

    Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, University of Cambridge, 1989

  36. [36]

    Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3--4):279--292, 1992. doi:10.1007/BF00992698

  37. [37]

    Breaking the deadly triad with a target network

    Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12621--12631. PMLR, 2021