pith. sign in

arxiv: 2606.24981 · v1 · pith:ZDBMHHLAnew · submitted 2026-06-23 · 💻 cs.LG · math.OC· stat.ML

A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging

Pith reviewed 2026-06-26 00:37 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords linear TD(0)Polyak-Ruppert averagingMarkovian samplinghigh-probability boundsmixing timetemporal difference learningstochastic approximation
0
0 comments X

The pith

A single stepsize schedule depending only on mixing time suffices for unprojected linear TD(0) with Polyak-Ruppert averaging to remain bounded and converge at the minimum of robust and fast rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that linear TD(0) under Markovian sampling can use one fixed stepsize rule that depends on the chain's mixing time but needs no knowledge of the curvature parameter. This rule makes the raw iterates bounded with high probability without any projection step or separate stability proof. The same rule then gives the Polyak-Ruppert average a high-probability error bound that automatically equals the smaller of a curvature-free robust rate and a curvature-dependent fast rate. A reader would care because it removes two common practical obstacles: choosing a projection radius and tuning a stepsize to an unknown curvature value.

Core claim

With the stepsize schedule η_t proportional to 1 over (τ_mix log(t) square-root t), the unprojected TD(0) iterates stay uniformly bounded with high probability on a single trajectory; the Polyak-Ruppert average then converges at the minimum of the curvature-free rate Õ(τ_mix over square-root T) and the curvature-dependent rate Õ(τ_mix squared over (ω T)).

What carries the argument

The Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes the Markov noise into a martingale term plus a controlled remainder and supports a self-bounding inductive argument for pathwise stability.

If this is right

  • No projection step is required for stability.
  • The curvature parameter ω need not be known in advance.
  • The same stepsize simultaneously delivers both the robust and the fast rate, taking the better of the two.
  • All guarantees are high-probability and hold uniformly over the trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-bounding technique may transfer to other stochastic approximation schemes that use dependent samples.
  • Practical implementations could drop projection operators entirely when the mixing time is known or estimated.
  • The result suggests testing whether similar single-stepsize rules work for nonlinear or deep TD variants.

Load-bearing premise

The data are generated by a geometrically mixing Markov chain.

What would settle it

A geometrically mixing chain on which the proposed stepsize produces iterates that escape any fixed ball with probability bounded away from zero, or on which the averaged error fails to meet the minimum of the two stated rates.

read the original abstract

We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain unprojected TD(0) algorithm with Polyak-Ruppert (PR) averaging, using a single stepsize schedule $\eta_t \propto \frac{1}{\tau_{\mathrm{mix}}\log(t)\sqrt{t}}$ that depends on the mixing time but requires no prior knowledge of the curvature parameter $\omega$. Our first result shows that such a choice of the stepsize guarantees that the TD(0) iterates are automatically and uniformly bounded with high probability, without projections and without any stability argument based on $\omega$. Building on this result, we establish a simultaneous high-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature-free $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}}{\sqrt{T}}\right)$ rate and a fast curvature-dependent $\widetilde{\mathcal{O}}\!\left(\frac{\tau_{\mathrm{mix}}^2}{\omega T}\right)$rate, with the bound taking the minimum of the two. The core technical ingredient is a Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self-bounding inductive argument for pathwise stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that for linear TD(0) under single-trajectory Markovian sampling, the stepsize schedule η_t ∝ 1/(τ_mix log(t) √t) (depending only on mixing time) ensures that the unprojected iterates remain uniformly bounded with high probability, without projections or any stability argument that uses the curvature parameter ω. Building on this, the Polyak-Ruppert averaged iterates simultaneously achieve a robust curvature-free high-probability rate Õ(τ_mix/√T) and a fast curvature-dependent rate Õ(τ_mix²/(ω T)), with the bound taken as the minimum of the two. The core device is a Poisson-equation decomposition that splits the Markov noise into a martingale plus a controlled remainder, enabling a new self-bounding inductive argument for pathwise stability.

Significance. If the derivations close, the result is significant: it removes the need for projections or knowledge of ω while delivering both robustness and fast rates from one schedule. The Poisson-equation toolkit and the self-bounding induction are explicit strengths that could be reusable. The work directly addresses a practical pain point in TD learning under Markovian data.

major comments (1)
  1. [self-bounding inductive argument for pathwise stability (core technical ingredient)] The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.
minor comments (2)
  1. [statement of main results] Clarify the precise definition of the Õ notation and whether the log(t) factor in η_t is absorbed into the Õ or appears explicitly in the final bounds.
  2. [assumptions] The geometric mixing assumption is stated as weakest; confirm that all constants in the rates are allowed to depend on the mixing parameters but not on ω.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of the result and for the careful reading of the technical core. We address the single major comment below.

read point-by-point responses
  1. Referee: The central boundedness claim (abstract and the self-bounding argument) asserts that the induction closes for arbitrary ω>0 with a stepsize independent of ω. However, because the Poisson remainder scales with ||θ_t|| and the contraction weakens as ω→0, the choice of threshold M and the martingale tail bound must be shown to close uniformly in ω. Please provide the explicit dependence (or independence) of M on ω and the accumulated sum of η_t in the relevant lemma or proposition that establishes the high-probability event.

    Authors: We agree that an explicit statement of the ω-independence of M is useful for clarity. In the self-bounding argument (Proposition 4.3 and the supporting Lemma 4.2), M is chosen as M = 2||θ_0|| + C τ_mix (1 + log(1/δ)), where C is an absolute constant; this choice does not depend on ω. The Poisson remainder is controlled by the geometric mixing and appears as an additive term whose accumulated effect is bounded by O(τ_mix sum_{s=1}^t η_s), but the induction is closed by showing that the probability of a large deviation is controlled via a Freedman-type martingale inequality whose variance proxy is sum η_s^2 ≲ 1/(τ_mix^2 log^2 t) summed up to O(1/τ_mix^2). Because the step-size schedule is independent of ω and the contraction ω is never invoked in the boundedness proof (only the smallness of η_t prevents overshoot), the same M works uniformly for all ω > 0. We will add a short remark after Proposition 4.3 that records these explicit expressions for M and the accumulated sums, confirming uniformity in ω. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained with no reduction to inputs by construction

full rationale

The provided abstract and outline describe a Poisson-equation decomposition of Markov noise into martingale plus remainder, followed by a self-bounding inductive argument for pathwise stability of unprojected TD(0) iterates under a stepsize depending only on τ_mix. The resulting high-probability bounds are stated as the minimum of a curvature-free Õ(τ_mix/√T) rate and a curvature-dependent Õ(τ_mix²/(ω T)) rate; both quantities are treated as external properties of the chain and operator rather than quantities fitted or defined from the iterates themselves. No equations, self-citations, or ansatzes are shown that would reduce the claimed rates or stability claim to a tautology or to a parameter estimated from the target result. The paper explicitly asserts the argument avoids any ω-based stability step, and the given text supplies no counter-evidence of circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard domain assumption of geometric mixing for the Markov chain; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The data-generating Markov chain is geometrically mixing.
    Required for the Poisson-equation decomposition of Markov noise into martingale plus controlled remainder.

pith-pipeline@v0.9.1-grok · 5796 in / 1295 out tokens · 54157 ms · 2026-06-26T00:37:35.914428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Conference on Learning Theory , pages=

    Stochastic linear optimization never overfits with quadratically-bounded losses on general data , author=. Conference on Learning Theory , pages=. 2022 , organization=

  2. [2]

    2023 , organization=

    Ivgi, Maor and Hinder, Oliver and Carmon, Yair , booktitle=. 2023 , organization=

  3. [3]

    Conference on Learning Theory , pages=

    A finite time analysis of temporal difference learning with linear function approximation , author=. Conference on Learning Theory , pages=. 2018 , organization=

  4. [4]

    International Conference on Machine Learning , pages=

    Temporal difference learning as gradient splitting , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  5. [5]

    A Simple Finite-Time Analysis of

    Mitra, Aritra , journal=. A Simple Finite-Time Analysis of. 2025 , volume=

  6. [6]

    arXiv preprint arXiv:2402.13903 , year=

    Dealing with unbounded gradients in stochastic saddle-point optimization , author=. arXiv preprint arXiv:2402.13903 , year=

  7. [7]

    International Conference on Machine Learning , pages=

    Unconstrained online learning with unbounded losses , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  8. [8]

    Stochastic gradient descent under

    Even, Mathieu , booktitle=. Stochastic gradient descent under. 2023 , organization=

  9. [9]

    Black-box reductions for parameter-free online learning in

    Cutkosky, Ashok and Orabona, Francesco , booktitle=. Black-box reductions for parameter-free online learning in. 2018 , organization=

  10. [10]

    2026 , publisher =

    Mannor, Shie and Mansour, Yishay and Tamar, Aviv , title =. 2026 , publisher =

  11. [11]

    Machine Learning , volume=

    Learning to predict by the methods of temporal differences , author=. Machine Learning , volume=. 1988 , publisher=

  12. [12]

    Logarithmic

    Diaconis, Persi and Saloff-Coste, Laurent , journal=. Logarithmic. 1996 , publisher=

  13. [13]

    Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

    Approximate temporal difference learning is a gradient descent for reversible policies , author=. arXiv preprint arXiv:1805.00869 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Analysis of temporal-difference learning with function approximation , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Journal of Artificial Intelligence Research , volume=

    Proximal gradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity , author=. Journal of Artificial Intelligence Research , volume=

  16. [16]

    Finite-time error bounds for linear stochastic approximation and

    Srikant, Rayadurgam and Ying, Lei , booktitle=. Finite-time error bounds for linear stochastic approximation and. 2019 , organization=

  17. [17]

    Mastering the game of

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and Ka...

  18. [18]

    2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , author=. 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2017 , organization=

  19. [19]

    Proceedings of the IEEE International Conference on Computer Vision , pages=

    Deepdriving: Learning affordance for direct perception in autonomous driving , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

  20. [20]

    Wiley Interdisciplinary Reviews: Computational Statistics , volume=

    Stochastic approximation: a survey , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=

  21. [21]

    Finite sample analyses for

    Dalal, Gal and Sz. Finite sample analyses for. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  22. [22]

    International Conference on Artificial Intelligence and Statistics , pages=

    Linear stochastic approximation: How far does constant step-size and iterate averaging go? , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

  23. [23]

    Korda, Nathaniel and La, Prashanth , booktitle=. On. 2015 , organization=

  24. [24]

    state-of-the-art

    Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results , author=. 2017 , institution=

  25. [25]

    arXiv preprint arXiv:2102.00236 , year=

    Parameter-free Stochastic Optimization of Variationally Coherent Functions , author =. arXiv preprint arXiv:2102.00236 , year=

  26. [26]

    2017 , publisher=

    Markov chains and mixing times , author=. 2017 , publisher=

  27. [27]

    Journal of Machine Learning Research , volume =

    Xiao, Lin , title =. Journal of Machine Learning Research , volume =. 2010 , pages =

  28. [28]

    arXiv preprint arXiv:2505.21391 , year=

    Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features , author=. arXiv preprint arXiv:2505.21391 , year=

  29. [29]

    A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with

    Khodadadian, Sajad and Zubeldia, Martin , journal=. A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with

  30. [30]

    The Thirty Seventh Annual Conference on Learning Theory , pages=

    Improved high-probability bounds for the temporal difference learning algorithm via exponential stability , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

  31. [31]

    SIAM Journal on Mathematics of Data Science , volume=

    Accelerated and instance-optimal policy evaluation with linear function approximation , author=. SIAM Journal on Mathematics of Data Science , volume=. 2023 , publisher=

  32. [32]

    International Conference on Artificial Intelligence and Statistics , pages=

    Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

  33. [33]

    Automatica , volume=

    Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning , author=. Automatica , volume=. 2022 , publisher=

  34. [34]

    1998 , publisher=

    Reinforcement Learning: An Introduction , author=. 1998 , publisher=

  35. [35]

    SIAM Journal on Optimization , volume=

    Robust stochastic approximation approach to stochastic programming , author=. SIAM Journal on Optimization , volume=. 2009 , publisher=

  36. [36]

    Least squares regression with

    Nagaraj, Dheeraj and Wu, Xian and Bresler, Guy and Jain, Prateek and Netrapalli, Praneeth , journal=. Least squares regression with

  37. [37]

    A Robust

    Lee, Wei-Cheng and Orabona, Francesco , journal=. A Robust

  38. [38]

    Chernoff-type bound for finite

    Lezaud, Pascal , journal=. Chernoff-type bound for finite. 1998 , publisher=

  39. [39]

    2024 , url =

    Harin Lee , title =. 2024 , url =

  40. [40]

    Towards Weaker Variance Assumptions for Stochastic Optimization

    Towards weaker variance assumptions for stochastic optimization , author=. arXiv preprint arXiv:2504.09951 , year=

  41. [41]

    arXiv preprint arXiv:2511.19937 , year=

    Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization , author=. arXiv preprint arXiv:2511.19937 , year=

  42. [42]

    A Unified Stability Analysis of

    Chang, Wei-Kai and Khanna, Rajiv , journal=. A Unified Stability Analysis of

  43. [43]

    Optimum bounds for the distributions of martingales in

    Pinelis, Iosif , journal=. Optimum bounds for the distributions of martingales in. 1994 , publisher=

  44. [44]

    The Annals of Probability , pages=

    On tail probabilities for martingales , author=. The Annals of Probability , pages=. 1975 , publisher=

  45. [45]

    Freedman's inequality for matrix martingales , author=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    Statistical efficiency of distributional temporal difference learning , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Adaptive temporal difference learning with linear function approximation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Finite-time analysis of adaptive temporal difference learning with deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    Journal of Machine Learning Research , volume=

    Adaptive subgradient methods for online learning and stochastic optimization , author=. Journal of Machine Learning Research , volume=

  50. [50]

    Adam: A Method for Stochastic Optimization

    Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

  51. [51]

    Adaptive Bound Optimization for Online Convex Optimization

    Adaptive bound optimization for online convex optimization , author=. arXiv preprint arXiv:1002.4908 , year=

  52. [52]

    International Conference on Machine Learning , pages=

    A convergence theory for deep learning via over-parameterization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  53. [53]

    Advances in Neural Information Processing Systems , volume=

    High-probability bounds for non-convex stochastic optimization with heavy tails , author=. Advances in Neural Information Processing Systems , volume=

  54. [54]

    2012 , publisher=

    Markov chains and stochastic stability , author=. 2012 , publisher=

  55. [55]

    arXiv preprint arXiv:2603.02577 , year=

    Towards Parameter-Free Temporal Difference Learning , author=. arXiv preprint arXiv:2603.02577 , year=

  56. [56]

    Mathematics of Operations Research , volume =

    Durmus, Alain and Moulines, Eric and Naumov, Alexey and Samsonov, Sergey , title =. Mathematics of Operations Research , volume =

  57. [57]

    High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=

    Li, Gen and Wu, Weichen and Chi, Yuejie and Ma, Cong and Rinaldo, Alessandro and Wei, Yuting , journal=. High-Probability Sample Complexities for Policy Evaluation With Linear Function Approximation , year=

  58. [58]

    Machine Learning , volume =

    Concentration Bounds for Temporal Difference Learning with Linear Function Approximation: The Case of Batch Data and Uniform Sampling , author =. Machine Learning , volume =

  59. [59]

    On Linear Stochastic Approximation: Fine-grained

    Mou, Wenlong and Li, Chris Junchi and Wainwright, Martin J and Bartlett, Peter L and Jordan, Michael I , booktitle =. On Linear Stochastic Approximation: Fine-grained

  60. [60]

    Asymptotic and finite sample analysis of nonexpansive stochastic approximations with

    Blaser, Ethan and Zhang, Shangtong , journal=. Asymptotic and finite sample analysis of nonexpansive stochastic approximations with

  61. [61]

    , title =

    Chandak, Siddharth and Borkar, Vivek S. , title =. Stochastic Systems , doi =

  62. [62]

    , title =

    Konda, Vijaymohan R. , title =. 2002 , address =

  63. [63]

    Sample complexity of asynchronous

    Li, Gen and Wei, Yuting and Chi, Yuejie and Gu, Yuantao and Chen, Yuxin , journal=. Sample complexity of asynchronous

  64. [64]

    Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient

    Raj, Anant and Joulani, Pooria and Gyorgy, Andras and Szepesv. Faster rates, adaptive algorithms, and finite-time bounds for linear composition optimization and gradient. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=