Recognition: 2 theorem links
· Lean TheoremBridging the Gap Between Average and Discounted TD Learning
Pith reviewed 2026-05-08 19:11 UTC · model grok-4.3
The pith
A new algorithm for average-reward TD learning converges to the projected Bellman solution using two Markovian trajectories and achieves quadratic sample complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The algorithm guarantees convergence to the unique solution of the projected Bellman equation. The convergence analysis holds uniformly across linear function approximation and tabular settings without dimension-dependent terms in the bounds, and the sample complexity scales quadratically rather than quartically with the problem's condition number.
What carries the argument
Sampling from two independent Markovian trajectories, which supports the definition of a projected Bellman equation possessing a unique solution and enables the convergence proof.
If this is right
- Convergence holds for both linear function approximation and tabular settings.
- Convergence bounds contain no explicit dimension-dependent terms.
- Sample complexity scales quadratically with the condition number.
- Theoretical properties match those established for discounted TD learning.
Where Pith is reading between the lines
- The two-trajectory sampling technique may extend to other RL settings that rely on non-contractive operators.
- Quadratic complexity could support scaling the method to larger state spaces than previous average-reward approaches.
- The uniform analysis might simplify implementation in continuing-task environments where average reward is the natural objective.
Load-bearing premise
It is feasible to sample from two independent Markovian trajectories and the projected Bellman equation possesses a unique solution.
What would settle it
Apply the algorithm to an instance where the projected Bellman equation lacks a unique solution and check whether iterates fail to converge, or measure whether the observed sample complexity scales worse than quadratically with the condition number.
Figures
read the original abstract
The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem's condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a novel TD learning algorithm for average-reward policy evaluation that samples from two independent Markovian trajectories. It claims convergence to the unique solution of a properly defined projected Bellman equation, with analysis uniformly applicable to both linear function approximation and tabular settings without explicit dimension-dependent terms in the bounds, and quadratic sample complexity dependence on the condition number (improving on prior quartic scaling to match the discounted case).
Significance. If the convergence guarantees and complexity results hold under the stated assumptions, the work would meaningfully close the theoretical gap between average-reward and discounted TD learning by delivering comparable guarantees and efficiency. The uniform treatment across linear FA and tabular cases without dimension dependence would be a notable strength.
major comments (2)
- Abstract: The quadratic sample-complexity claim and uniform applicability rest on the ability to sample two independent Markovian trajectories and on uniqueness of the projected Bellman equation. These assumptions are load-bearing; the manuscript must delineate the precise conditions (e.g., ergodicity requirements or feature-matrix properties) under which they hold, as violation under standard single-trajectory continuing-task settings would invalidate the stated improvements over prior quartic bounds.
- Abstract and analysis sections: The claimed absence of explicit dimension-dependent terms and the reduction from quartic to quadratic scaling are not verifiable from the provided abstract alone. The key steps of the convergence proof and the explicit dependence on the condition number should be summarized in the main text with reference to the relevant lemmas or theorems.
minor comments (1)
- The abstract would benefit from a concise statement of the exact form of the projected Bellman equation used to restore contractivity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments and recommendation for major revision. We address each major comment below, providing clarifications and committing to revisions that will enhance the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: Abstract: The quadratic sample-complexity claim and uniform applicability rest on the ability to sample two independent Markovian trajectories and on uniqueness of the projected Bellman equation. These assumptions are load-bearing; the manuscript must delineate the precise conditions (e.g., ergodicity requirements or feature-matrix properties) under which they hold, as violation under standard single-trajectory continuing-task settings would invalidate the stated improvements over prior quartic bounds.
Authors: We agree that clearly delineating these assumptions is essential. Our algorithm relies on the ability to generate two independent Markovian trajectories, which is a standard assumption in many average-reward settings with access to a simulator or multiple rollouts. We will revise the manuscript by adding a new subsection titled 'Assumptions and Setup' in the preliminaries, where we explicitly state: (1) the Markov chain is ergodic (irreducible and aperiodic) ensuring a unique stationary distribution π; (2) the feature matrix Φ has full column rank, guaranteeing uniqueness of the solution to the projected Bellman equation; and (3) the two trajectories are sampled independently from the stationary distribution. This will distinguish our setting from single-trajectory continuing tasks and justify the quadratic sample complexity. We will also update the abstract to briefly mention these conditions if space permits. revision: yes
-
Referee: Abstract and analysis sections: The claimed absence of explicit dimension-dependent terms and the reduction from quartic to quadratic scaling are not verifiable from the provided abstract alone. The key steps of the convergence proof and the explicit dependence on the condition number should be summarized in the main text with reference to the relevant lemmas or theorems.
Authors: Thank you for this feedback on presentation. Although the abstract is necessarily brief, we will expand the 'Overview of Results' or introduction section to provide a high-level summary of the proof. Specifically, we will outline: the use of two trajectories to construct an unbiased estimator for the average-reward Bellman operator; the establishment of a contraction property in the norm induced by the stationary distribution (with contraction factor depending on the mixing time but independent of dimension); and how this leads to quadratic dependence on the condition number κ of the covariance matrix (sample complexity scaling as O(κ² log(1/δ)/ε²) for accuracy ε with probability 1-δ). We will reference the key results, such as Theorem 1 for convergence and Lemma 3 for the contraction mapping, and note the absence of explicit d (dimension) terms due to the norm choice and analysis technique. This will allow readers to verify the claims from the main text. revision: yes
Circularity Check
No circularity: convergence and complexity results derived from new two-trajectory algorithm and standard assumptions
full rationale
The paper introduces a novel TD algorithm that samples from two independent Markovian trajectories to address non-contractivity in average-reward settings. It establishes convergence to the unique solution of a projected Bellman equation with dimension-independent bounds and quadratic sample complexity. These claims rest on the algorithm definition, ergodicity assumptions, and the posited uniqueness of the projected operator rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. The derivation is self-contained against external benchmarks and does not reduce the target results to quantities defined via the paper's own fitted inputs or prior author work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The properly defined projected Bellman equation possesses a unique solution.
- domain assumption Independent samples can be drawn from two Markovian trajectories of the underlying chain.
Reference graph
Works this paper leans on
-
[1]
Ergodic control of continuous-time Markov chains with pathwise constraints.SIAM journal on control and optimization, 47(4):1888–1908, 2008
Tomás Prieto-Rumeau and Onésimo Hernández-Lerma. Ergodic control of continuous-time Markov chains with pathwise constraints.SIAM journal on control and optimization, 47(4):1888–1908, 2008
1908
-
[2]
Cambridge University Press, 2016
Vikram Krishnamurthy.Partially observed Markov decision processes. Cambridge University Press, 2016
2016
-
[3]
Athena Scientific, 2021
Dimitri Bertsekas and Robert Gallager.Data networks. Athena Scientific, 2021
2021
-
[4]
Routledge, 2021
Eitan Altman.Constrained Markov decision processes. Routledge, 2021
2021
-
[5]
Springer Science & Business Media, 2012
Eugene A Feinberg and Adam Shwartz.Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012
2012
-
[6]
A finite time analysis of temporal difference learning with linear function approximation
Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018
2018
-
[7]
Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022
2022
-
[8]
Haoxing Tian, Ioannis Ch Paschalidis, and Alex Olshevsky. On the performance of temporal difference learning with neural networks.arXiv preprint arXiv:2312.05397, 2023
-
[9]
Finite-time error bounds for linear stochastic approximation and TD learning
Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019
2019
-
[10]
Average cost temporal-difference learning.Automatica, 35(11):1799– 1808, 1999
John N Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning.Automatica, 35(11):1799– 1808, 1999
1999
-
[11]
Finite sample analysis of average-reward TD learning and Q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021
Sheng Zhang, Zhe Zhang, and Siva Theja Maguluri. Finite sample analysis of average-reward TD learning and Q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021
2021
-
[12]
Almost sure convergence of average reward temporal difference learning
Ethan Blaser and Shangtong Zhang. Almost sure convergence of average reward temporal difference learning. arXiv preprint arXiv:2409.19546, 2024
-
[13]
Convergence results for some temporal difference methods based on least squares.IEEE Transactions on Automatic Control, 54(7):1515–1531, 2009
Huizhen Yu and Dimitri P Bertsekas. Convergence results for some temporal difference methods based on least squares.IEEE Transactions on Automatic Control, 54(7):1515–1531, 2009
2009
-
[14]
Average-reward off-policy policy evaluation with function approximation
Shangtong Zhang, Yi Wan, Richard S Sutton, and Shimon Whiteson. Average-reward off-policy policy evaluation with function approximation. InInternational Conference on Machine Learning, pages 12578–12588. PMLR, 2021
2021
-
[15]
Stochastic first-order methods for average-reward Markov decision processes.Mathematics of Operations Research, 2024
Tianjiao Li, Feiyang Wu, and Guanghui Lan. Stochastic first-order methods for average-reward Markov decision processes.Mathematics of Operations Research, 2024
2024
-
[16]
Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024
Mario Bravo and Roberto Cominetti. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024
2024
-
[17]
Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.arXiv preprint arXiv:2410.21704, 2024
-
[18]
Zaiwei Chen, Sheng Zhang, Zhe Zhang, Shaan Ul Haque, and Siva Theja Maguluri. A non-asymptotic the- ory of seminorm lyapunov stability: From deterministic to stochastic iterative algorithms.arXiv preprint arXiv:2502.14208, 2025
-
[19]
Temporal difference learning as gradient splitting
Rui Liu and Alex Olshevsky. Temporal difference learning as gradient splitting. InInternational Conference on Machine Learning, pages 6905–6913. PMLR, 2021. 10 APREPRINT- MAY5, 2026
2021
-
[20]
A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation.Advances in neural information processing systems, 21, 2008
Richard S Sutton, Hamid Maei, and Csaba Szepesvári. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation.Advances in neural information processing systems, 21, 2008
2008
-
[21]
Hwanwoo Kim, Dongkyu Derek Cho, and Eric Laber. Implicit updates for average-reward temporal difference learning.arXiv preprint arXiv:2510.06149, 2025
-
[22]
American Mathematical Soc., 2017
David A Levin and Yuval Peres.Markov chains and mixing times, volume 107. American Mathematical Soc., 2017
2017
-
[23]
Yann Ollivier. Approximate temporal difference learning is a gradient descent for reversible policies.arXiv preprint arXiv:1805.00869, 2018
-
[24]
Analysis of temporal-diffference learning with function approximation
John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996
1996
-
[25]
Discrete stochastic processes.Journal of the Operational Research Society, 48(1):103–103, 1997
Robert G Gallager. Discrete stochastic processes.Journal of the Operational Research Society, 48(1):103–103, 1997
1997
-
[26]
Springer Science & Business Media, 2012
Sean P Meyn and Richard L Tweedie.Markov chains and stochastic stability. Springer Science & Business Media, 2012
2012
-
[27]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
Alegre, Ann Nowé, Ana L
Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023. 11 APREPRINT- MAY5, 2026 A Bellman Operator and...
2023
-
[29]
Ifξ∈(0,1)andc 0 ≥ 2ξ aη 1 1−ξ , then E ∥δT ∥2 ≤β 1 exp − ηa 1−ξ (T+c 0)1−ξ −(τ mix +c 0)1−ξ + β(T) η(T+c 0)ξ
-
[30]
Throughout, we use the same notations as in Appendix C
Ifξ= 1andc 0 ≥ηa, then E ∥δT ∥2 ≤β 1 τmix +c 0 T+c 0 ηa + Γ(T) (T+c 0)q , whereq= min{1, ηa}, ˜β(T) := 2L 1β2 log(T+c 0)−loga+ 1 and Γ(T) := 4a2 1−ηa ˜β(T),0< a <1/η, 4a2 log(T+c 0) ˜β(T), a= 1/η, 4ea2 ηa−1 ˜β(T), a >1/η. Throughout, we use the same notations as in Appendix C. D.1 Proof of Theorem 4.3 Proof of Theorem 4.3. To control Marko...
2026
-
[31]
whereρ min = min{c1,λ,2ρ 0}
Ifξ∈(0,1)and c0 ≥max ( ξ ac1,λ 1 1−ξ , ξ ρ0a 1 1−ξ ) , then E ∥δθ T ∥2 ≤4R 2 θ ·exp − ac1,λ 1−ξ ∆T + a G1(T) +ρ 0c2,λG2(T) c1,λ(T+c 0)ξ + 4ac2,λ 1−ξ exp ac1,λ cξ 0 ! ∆T exp − a 1−ξ ρmin∆T . whereρ min = min{c1,λ,2ρ 0}
-
[32]
Ifξ= 1,aη <1, andλ 2 ≥ 2η−4ρ0 rmax+2Rθ , then E ∥δθ T ∥2 ≤4R 2 θ · τmix +c 0 T+c 0 ac1,λ + Γ1 + Γ0G1(T) (T+c 0)ac1,λ , whereΓ 0 andΓ 1 are defined in Appendix F. Proof of Theorem F .1.As before, the dynamic ofθ t satisfies E∥δθ t+1∥2 ≤E∥δ θ t ∥2 + 2αtE h δθ t ⊤ g(st, s′ t, wt, θt) i +α 2 t E ∥g(st, s′ t, wt, θt)∥2 =E∥δ θ t ∥2 + 2αtE h δθ t ⊤ g(st, s′ t, w...
2026
-
[33]
(41) Next we controlE∥δ w j ∥2
Using1−x≤e −x, I ′ 1 ≤exp −c1,λ T−1X i=τmix αi ! = exp −ac1,λ T−1X i=τmix 1 (i+c 0)ξ ! ≤exp −ac1,λ Z T τmix dx (x+c 0)ξ ! ,(40) hence I ′ 1 ≤ τmix+c0 T+c 0 ac1,λ , ξ= 1, exp − ac1,λ 1−ξ (T+c 0)1−ξ −(τ mix +c 0)1−ξ , ξ∈(0,1). (41) Next we controlE∥δ w j ∥2. 39 APREPRINT- MAY5, 2026 Lemma F.3.Let G2(T) := 64L 1 log(T+c 0)−log(ρ 0a) + 1 + 12.(42) The...
2026
-
[34]
Ifξ∈(0,1)andc 0 ≥ ξ ρ0a 1 1−ξ , then E∥δw t ∥2 ≤4 exp − 2ρ0a 1−ξ (t+c 0)1−ξ −(τ mix +c 0)1−ξ +G 2(T)· ρ0a (t+c 0)ξ
-
[35]
(b) Ifρ 0a= 1/2, then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 log(t+c 0) t+c 0
Ifξ= 1, then: (a) Ifρ 0a∈(0,1/2), then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 (1−2ρ 0a)(t+c 0)2ρ0a . (b) Ifρ 0a= 1/2, then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 log(t+c 0) t+c 0 . (c) Ifρ 0a∈(1/2,∞), then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4eρ2 0a2 (2ρ0a−1)(t+c 0) . The proof is deferred to Section F.2. We now split into t...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.