arxiv: 2605.02103 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Bridging the Gap Between Average and Discounted TD Learning

Haoxing Tian , Zaiwei Chen , Ioannis Ch. Paschalidis , Alex Olshevsky

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords temporal difference learningaverage rewardpolicy evaluationprojected Bellman equationconvergence analysissample complexityreinforcement learning

0 comments

The pith

A new algorithm for average-reward TD learning converges to the projected Bellman solution using two Markovian trajectories and achieves quadratic sample complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a TD learning algorithm for policy evaluation in the average-reward setting that draws samples from two independent Markovian trajectories. It establishes convergence to the unique solution of a properly defined projected Bellman equation, with analysis that applies uniformly to both linear function approximation and tabular representations without explicit dimension terms in the bounds. The method reduces sample complexity dependence on the condition number from quartic to quadratic scaling, aligning the theoretical guarantees with those already available for discounted TD learning.

Core claim

The algorithm guarantees convergence to the unique solution of the projected Bellman equation. The convergence analysis holds uniformly across linear function approximation and tabular settings without dimension-dependent terms in the bounds, and the sample complexity scales quadratically rather than quartically with the problem's condition number.

What carries the argument

Sampling from two independent Markovian trajectories, which supports the definition of a projected Bellman equation possessing a unique solution and enables the convergence proof.

If this is right

Convergence holds for both linear function approximation and tabular settings.
Convergence bounds contain no explicit dimension-dependent terms.
Sample complexity scales quadratically with the condition number.
Theoretical properties match those established for discounted TD learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-trajectory sampling technique may extend to other RL settings that rely on non-contractive operators.
Quadratic complexity could support scaling the method to larger state spaces than previous average-reward approaches.
The uniform analysis might simplify implementation in continuing-task environments where average reward is the natural objective.

Load-bearing premise

It is feasible to sample from two independent Markovian trajectories and the projected Bellman equation possesses a unique solution.

What would settle it

Apply the algorithm to an instance where the projected Bellman equation lacks a unique solution and check whether iterates fail to converge, or measure whether the observed sample complexity scales worse than quadratically with the condition number.

Figures

Figures reproduced from arXiv: 2605.02103 by Alex Olshevsky, Haoxing Tian, Ioannis Ch. Paschalidis, Zaiwei Chen.

**Figure 1.** Figure 1: Illustration of key property of the gradient splitting. The splitting vector h(θ) has the same inner product (up to a factor of 2) as the true gradient with the vector θ − θ ∗ . Thus, this quantity could easily be estimated separately without affecting the results of this section. A detailed discussion is in Appendix B. 4.3 Proof Idea Our main observation is that the methods we propose can be analyzed in m… view at source ↗

**Figure 2.** Figure 2: Simulation results 48 view at source ↗

read the original abstract

The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem's condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's two-trajectory sampling scheme delivers quadratic sample complexity and dimension-free bounds for average-reward TD, but only under the assumption that independent chains are available.

read the letter

The main thing to know is that this work gives a concrete way to close part of the gap between average-reward and discounted TD learning. By sampling from two independent Markovian trajectories, the authors obtain quadratic sample complexity and convergence bounds that carry no explicit dimension dependence and apply uniformly to both tabular and linear function approximation settings. That is the clear advance over the quartic results in the prior literature they cite. They restore uniqueness and contractivity by working with a suitably projected Bellman equation, which lets the analysis mirror the discounted case more closely than single-trajectory approaches have managed so far. The improvement in dependence on the condition number is the practical payoff they emphasize. The analysis looks internally consistent on the claims that are stated, and the two-trajectory device is a straightforward way to break the correlations that usually inflate the rates. The soft spot is the sampling assumption itself. Many continuing-task problems supply only a single trajectory, so the requirement of two independent chains limits how directly the bounds translate to practice. The uniqueness of the projected equation also needs the features to satisfy conditions that are not automatic; if those fail, the uniform applicability and quadratic scaling do not hold. The paper does not appear to weaken those requirements or provide fallback single-trajectory rates. This is aimed at theorists who care about sample-complexity gaps in average-reward RL. A reader working on continuing tasks in operations research or robotics will see value in the improved scaling and the uniform treatment. The claims are specific enough and the approach coherent enough that the paper deserves a serious referee to check the proofs and the tightness of the assumptions.

Referee Report

2 major / 1 minor

Summary. The paper introduces a novel TD learning algorithm for average-reward policy evaluation that samples from two independent Markovian trajectories. It claims convergence to the unique solution of a properly defined projected Bellman equation, with analysis uniformly applicable to both linear function approximation and tabular settings without explicit dimension-dependent terms in the bounds, and quadratic sample complexity dependence on the condition number (improving on prior quartic scaling to match the discounted case).

Significance. If the convergence guarantees and complexity results hold under the stated assumptions, the work would meaningfully close the theoretical gap between average-reward and discounted TD learning by delivering comparable guarantees and efficiency. The uniform treatment across linear FA and tabular cases without dimension dependence would be a notable strength.

major comments (2)

Abstract: The quadratic sample-complexity claim and uniform applicability rest on the ability to sample two independent Markovian trajectories and on uniqueness of the projected Bellman equation. These assumptions are load-bearing; the manuscript must delineate the precise conditions (e.g., ergodicity requirements or feature-matrix properties) under which they hold, as violation under standard single-trajectory continuing-task settings would invalidate the stated improvements over prior quartic bounds.
Abstract and analysis sections: The claimed absence of explicit dimension-dependent terms and the reduction from quartic to quadratic scaling are not verifiable from the provided abstract alone. The key steps of the convergence proof and the explicit dependence on the condition number should be summarized in the main text with reference to the relevant lemmas or theorems.

minor comments (1)

The abstract would benefit from a concise statement of the exact form of the projected Bellman equation used to restore contractivity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and recommendation for major revision. We address each major comment below, providing clarifications and committing to revisions that will enhance the manuscript's clarity and rigor.

read point-by-point responses

Referee: Abstract: The quadratic sample-complexity claim and uniform applicability rest on the ability to sample two independent Markovian trajectories and on uniqueness of the projected Bellman equation. These assumptions are load-bearing; the manuscript must delineate the precise conditions (e.g., ergodicity requirements or feature-matrix properties) under which they hold, as violation under standard single-trajectory continuing-task settings would invalidate the stated improvements over prior quartic bounds.

Authors: We agree that clearly delineating these assumptions is essential. Our algorithm relies on the ability to generate two independent Markovian trajectories, which is a standard assumption in many average-reward settings with access to a simulator or multiple rollouts. We will revise the manuscript by adding a new subsection titled 'Assumptions and Setup' in the preliminaries, where we explicitly state: (1) the Markov chain is ergodic (irreducible and aperiodic) ensuring a unique stationary distribution π; (2) the feature matrix Φ has full column rank, guaranteeing uniqueness of the solution to the projected Bellman equation; and (3) the two trajectories are sampled independently from the stationary distribution. This will distinguish our setting from single-trajectory continuing tasks and justify the quadratic sample complexity. We will also update the abstract to briefly mention these conditions if space permits. revision: yes
Referee: Abstract and analysis sections: The claimed absence of explicit dimension-dependent terms and the reduction from quartic to quadratic scaling are not verifiable from the provided abstract alone. The key steps of the convergence proof and the explicit dependence on the condition number should be summarized in the main text with reference to the relevant lemmas or theorems.

Authors: Thank you for this feedback on presentation. Although the abstract is necessarily brief, we will expand the 'Overview of Results' or introduction section to provide a high-level summary of the proof. Specifically, we will outline: the use of two trajectories to construct an unbiased estimator for the average-reward Bellman operator; the establishment of a contraction property in the norm induced by the stationary distribution (with contraction factor depending on the mixing time but independent of dimension); and how this leads to quadratic dependence on the condition number κ of the covariance matrix (sample complexity scaling as O(κ² log(1/δ)/ε²) for accuracy ε with probability 1-δ). We will reference the key results, such as Theorem 1 for convergence and Lemma 3 for the contraction mapping, and note the absence of explicit d (dimension) terms due to the norm choice and analysis technique. This will allow readers to verify the claims from the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence and complexity results derived from new two-trajectory algorithm and standard assumptions

full rationale

The paper introduces a novel TD algorithm that samples from two independent Markovian trajectories to address non-contractivity in average-reward settings. It establishes convergence to the unique solution of a projected Bellman equation with dimension-independent bounds and quadratic sample complexity. These claims rest on the algorithm definition, ergodicity assumptions, and the posited uniqueness of the projected operator rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. The derivation is self-contained against external benchmarks and does not reduce the target results to quantities defined via the paper's own fitted inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the existence of a unique solution to the projected Bellman equation and the feasibility of drawing independent samples from two Markov trajectories; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption The properly defined projected Bellman equation possesses a unique solution.
Invoked to guarantee convergence of the iterates.
domain assumption Independent samples can be drawn from two Markovian trajectories of the underlying chain.
Core mechanism of the proposed algorithm.

pith-pipeline@v0.9.0 · 5485 in / 1335 out tokens · 28566 ms · 2026-05-08T19:11:58.662110+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Ergodic control of continuous-time Markov chains with pathwise constraints.SIAM journal on control and optimization, 47(4):1888–1908, 2008

Tomás Prieto-Rumeau and Onésimo Hernández-Lerma. Ergodic control of continuous-time Markov chains with pathwise constraints.SIAM journal on control and optimization, 47(4):1888–1908, 2008

1908
[2]

Cambridge University Press, 2016

Vikram Krishnamurthy.Partially observed Markov decision processes. Cambridge University Press, 2016

2016
[3]

Athena Scientific, 2021

Dimitri Bertsekas and Robert Gallager.Data networks. Athena Scientific, 2021

2021
[4]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

2021
[5]

Springer Science & Business Media, 2012

Eugene A Feinberg and Adam Shwartz.Handbook of Markov decision processes: methods and applications, volume 40. Springer Science & Business Media, 2012

2012
[6]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018

2018
[7]

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

Zaiwei Chen, Sheng Zhang, Thinh T Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623, 2022

2022
[8]

On the performance of temporal difference learning with neural networks.arXiv preprint arXiv:2312.05397, 2023

Haoxing Tian, Ioannis Ch Paschalidis, and Alex Olshevsky. On the performance of temporal difference learning with neural networks.arXiv preprint arXiv:2312.05397, 2023

work page arXiv 2023
[9]

Finite-time error bounds for linear stochastic approximation and TD learning

Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation and TD learning. InConference on Learning Theory, pages 2803–2830. PMLR, 2019

2019
[10]

Average cost temporal-difference learning.Automatica, 35(11):1799– 1808, 1999

John N Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning.Automatica, 35(11):1799– 1808, 1999

1999
[11]

Finite sample analysis of average-reward TD learning and Q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

Sheng Zhang, Zhe Zhang, and Siva Theja Maguluri. Finite sample analysis of average-reward TD learning and Q-learning.Advances in Neural Information Processing Systems, 34:1230–1242, 2021

2021
[12]

Almost sure convergence of average reward temporal difference learning

Ethan Blaser and Shangtong Zhang. Almost sure convergence of average reward temporal difference learning. arXiv preprint arXiv:2409.19546, 2024

work page arXiv 2024
[13]

Convergence results for some temporal difference methods based on least squares.IEEE Transactions on Automatic Control, 54(7):1515–1531, 2009

Huizhen Yu and Dimitri P Bertsekas. Convergence results for some temporal difference methods based on least squares.IEEE Transactions on Automatic Control, 54(7):1515–1531, 2009

2009
[14]

Average-reward off-policy policy evaluation with function approximation

Shangtong Zhang, Yi Wan, Richard S Sutton, and Shimon Whiteson. Average-reward off-policy policy evaluation with function approximation. InInternational Conference on Machine Learning, pages 12578–12588. PMLR, 2021

2021
[15]

Stochastic first-order methods for average-reward Markov decision processes.Mathematics of Operations Research, 2024

Tianjiao Li, Feiyang Wu, and Guanghui Lan. Stochastic first-order methods for average-reward Markov decision processes.Mathematics of Operations Research, 2024

2024
[16]

Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

Mario Bravo and Roberto Cominetti. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds.SIAM Journal on Control and Optimization, 62(1):191–219, 2024

2024
[17]

Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.arXiv preprint arXiv:2410.21704, 2024

Shaan Ul Haque and Siva Theja Maguluri. Stochastic approximation with unbounded Markovian noise: A general-purpose theorem.arXiv preprint arXiv:2410.21704, 2024

work page arXiv 2024
[18]

A non-asymptotic theory of seminorm lyapunov stability: From deterministic to stoc hastic iterative algorithms,

Zaiwei Chen, Sheng Zhang, Zhe Zhang, Shaan Ul Haque, and Siva Theja Maguluri. A non-asymptotic the- ory of seminorm lyapunov stability: From deterministic to stochastic iterative algorithms.arXiv preprint arXiv:2502.14208, 2025

work page arXiv 2025
[19]

Temporal difference learning as gradient splitting

Rui Liu and Alex Olshevsky. Temporal difference learning as gradient splitting. InInternational Conference on Machine Learning, pages 6905–6913. PMLR, 2021. 10 APREPRINT- MAY5, 2026

2021
[20]

A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation.Advances in neural information processing systems, 21, 2008

Richard S Sutton, Hamid Maei, and Csaba Szepesvári. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation.Advances in neural information processing systems, 21, 2008

2008
[21]

Implicit updates for average-reward temporal difference learning.arXiv preprint arXiv:2510.06149, 2025

Hwanwoo Kim, Dongkyu Derek Cho, and Eric Laber. Implicit updates for average-reward temporal difference learning.arXiv preprint arXiv:2510.06149, 2025

work page arXiv 2025
[22]

American Mathematical Soc., 2017

David A Levin and Yuval Peres.Markov chains and mixing times, volume 107. American Mathematical Soc., 2017

2017
[23]

Approximate temporal difference learning is a gradient descent for reversible policies.arXiv preprint arXiv:1805.00869, 2018

Yann Ollivier. Approximate temporal difference learning is a gradient descent for reversible policies.arXiv preprint arXiv:1805.00869, 2018

work page arXiv 2018
[24]

Analysis of temporal-diffference learning with function approximation

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996

1996
[25]

Discrete stochastic processes.Journal of the Operational Research Society, 48(1):103–103, 1997

Robert G Gallager. Discrete stochastic processes.Journal of the Operational Research Society, 48(1):103–103, 1997

1997
[26]

Springer Science & Business Media, 2012

Sean P Meyn and Richard L Tweedie.Markov chains and stochastic stability. Springer Science & Business Media, 2012

2012
[27]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review arXiv 2024
[28]

Alegre, Ann Nowé, Ana L

Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023. 11 APREPRINT- MAY5, 2026 A Bellman Operator and...

2023
[29]

Ifξ∈(0,1)andc 0 ≥ 2ξ aη 1 1−ξ , then E ∥δT ∥2 ≤β 1 exp − ηa 1−ξ (T+c 0)1−ξ −(τ mix +c 0)1−ξ + β(T) η(T+c 0)ξ
[30]

Throughout, we use the same notations as in Appendix C

Ifξ= 1andc 0 ≥ηa, then E ∥δT ∥2 ≤β 1 τmix +c 0 T+c 0 ηa + Γ(T) (T+c 0)q , whereq= min{1, ηa}, ˜β(T) := 2L 1β2 log(T+c 0)−loga+ 1 and Γ(T) :=    4a2 1−ηa ˜β(T),0< a <1/η, 4a2 log(T+c 0) ˜β(T), a= 1/η, 4ea2 ηa−1 ˜β(T), a >1/η. Throughout, we use the same notations as in Appendix C. D.1 Proof of Theorem 4.3 Proof of Theorem 4.3. To control Marko...

2026
[31]

whereρ min = min{c1,λ,2ρ 0}

Ifξ∈(0,1)and c0 ≥max ( ξ ac1,λ 1 1−ξ , ξ ρ0a 1 1−ξ ) , then E ∥δθ T ∥2 ≤4R 2 θ ·exp − ac1,λ 1−ξ ∆T + a G1(T) +ρ 0c2,λG2(T) c1,λ(T+c 0)ξ + 4ac2,λ 1−ξ exp ac1,λ cξ 0 ! ∆T exp − a 1−ξ ρmin∆T . whereρ min = min{c1,λ,2ρ 0}
[32]

Ifξ= 1,aη <1, andλ 2 ≥ 2η−4ρ0 rmax+2Rθ , then E ∥δθ T ∥2 ≤4R 2 θ · τmix +c 0 T+c 0 ac1,λ + Γ1 + Γ0G1(T) (T+c 0)ac1,λ , whereΓ 0 andΓ 1 are defined in Appendix F. Proof of Theorem F .1.As before, the dynamic ofθ t satisfies E∥δθ t+1∥2 ≤E∥δ θ t ∥2 + 2αtE h δθ t ⊤ g(st, s′ t, wt, θt) i +α 2 t E ∥g(st, s′ t, wt, θt)∥2 =E∥δ θ t ∥2 + 2αtE h δθ t ⊤ g(st, s′ t, w...

2026
[33]

(41) Next we controlE∥δ w j ∥2

Using1−x≤e −x, I ′ 1 ≤exp −c1,λ T−1X i=τmix αi ! = exp −ac1,λ T−1X i=τmix 1 (i+c 0)ξ ! ≤exp −ac1,λ Z T τmix dx (x+c 0)ξ ! ,(40) hence I ′ 1 ≤    τmix+c0 T+c 0 ac1,λ , ξ= 1, exp − ac1,λ 1−ξ (T+c 0)1−ξ −(τ mix +c 0)1−ξ , ξ∈(0,1). (41) Next we controlE∥δ w j ∥2. 39 APREPRINT- MAY5, 2026 Lemma F.3.Let G2(T) := 64L 1 log(T+c 0)−log(ρ 0a) + 1 + 12.(42) The...

2026
[34]

Ifξ∈(0,1)andc 0 ≥ ξ ρ0a 1 1−ξ , then E∥δw t ∥2 ≤4 exp − 2ρ0a 1−ξ (t+c 0)1−ξ −(τ mix +c 0)1−ξ +G 2(T)· ρ0a (t+c 0)ξ
[35]

(b) Ifρ 0a= 1/2, then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 log(t+c 0) t+c 0

Ifξ= 1, then: (a) Ifρ 0a∈(0,1/2), then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 (1−2ρ 0a)(t+c 0)2ρ0a . (b) Ifρ 0a= 1/2, then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4ρ2 0a2 log(t+c 0) t+c 0 . (c) Ifρ 0a∈(1/2,∞), then E∥δw t ∥2 ≤4 τmix +c 0 t+c 0 2ρ0a +G 2(T)· 4eρ2 0a2 (2ρ0a−1)(t+c 0) . The proof is deferred to Section F.2. We now split into t...

2026