pith. sign in

arxiv: 2509.08660 · v3 · submitted 2025-09-10 · 💻 cs.LG

Replicable Reinforcement Learning with Linear Function Approximation

Pith reviewed 2026-05-18 17:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords replicable reinforcement learninglinear function approximationlinear MDPsrandom design regressionuncentered covariance estimationgenerative modelepisodic setting
0
0 comments X

The pith

Replicable reinforcement learning with linear function approximation is now provably efficient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops replicable algorithms for reinforcement learning that use linear function approximation and run efficiently. It begins by creating two new algorithms for replicable random design regression and uncentered covariance estimation. These are then used to build replicable methods for linear Markov decision processes, working in both the generative model where arbitrary transitions can be sampled and the episodic setting where the agent follows trajectories. If correct, this means researchers can obtain identical RL results across different runs or datasets, addressing a key source of instability in practical reinforcement learning.

Core claim

By developing efficient replicable algorithms for random design regression and uncentered covariance estimation, the authors obtain the first provably efficient replicable RL algorithms for linear Markov decision processes, applicable in both the generative model and episodic settings.

What carries the argument

Replicable random design regression and replicable uncentered covariance estimation, which produce the same output on any two independent samples from the same distribution.

If this is right

  • Replicable RL algorithms exist with polynomial sample complexity for linear MDPs.
  • The methods apply to both generative and episodic access models.
  • Experiments indicate potential for more consistent policies even with neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar replicability techniques could be developed for other function approximation methods in RL.
  • These algorithms might serve as a baseline for studying reproducibility in RL research.
  • Adopting replicable methods could change how RL results are validated in the field.

Load-bearing premise

The underlying MDP has transition and reward functions that are linear in a known feature map of dimension d.

What would settle it

A direct test where the algorithm is executed on two separate sample sets drawn from the same linear MDP distribution, checking if the returned policies or value estimates are exactly the same.

Figures

Figures reproduced from arXiv: 2509.08660 by Aaron Roth, Eric Eaton, Jessica Sorrell, Marcel Hussing, Michael Kearns, Sikata Bela Sengupta.

Figure 1
Figure 1. Figure 1: Mean return and percentage of most common identical weight vector. "Not replicable" indicates a baseline with￾out regularization and rounding. Using only a fraction of the data is sufficient to achieve replicability. To show that our algorithms do not require impractically large amounts of data, we implement a version of fitted Q￾iteration [EGW05] with replicable rounding akin to our generative model algor… view at source ↗
Figure 2
Figure 2. Figure 2: Mean return and agreement on MsPacman (left) [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces two new algorithms for replicable random-design regression and uncentered covariance estimation. These tools are then composed into the first provably efficient replicable RL algorithms for linear MDPs, covering both the generative-model setting and the episodic setting. The work includes experimental evaluations on neural policies to illustrate improved consistency.

Significance. If the theoretical claims hold, the contribution is significant: it closes an open gap by extending replicability guarantees from tabular RL to the practically relevant linear function-approximation regime while preserving polynomial sample complexity. The standalone regression and covariance estimators are of independent interest and may be reusable in other statistical learning settings. The experimental section provides concrete evidence that the approach can inspire more stable neural policies.

major comments (1)
  1. [§4.2] §4.2, the episodic algorithm: the replicability parameter for the overall procedure is obtained by composing the regression and covariance estimators inside least-squares value iteration; an explicit calculation showing that the total failure probability and sample complexity remain polynomial (without an extra factor exponential in the horizon or number of iterations) is required to support the central efficiency claim.
minor comments (2)
  1. [Abstract] The abstract states that the algorithms are evaluated experimentally but gives no information on the environments, baselines, or metrics; a one-sentence summary would improve readability.
  2. [§2] Notation for the feature dimension d and the replicability parameter ε is introduced in §2 but occasionally reused with different subscripts later; a short notation table would eliminate ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comment on the episodic algorithm. We address the point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4.2] §4.2, the episodic algorithm: the replicability parameter for the overall procedure is obtained by composing the regression and covariance estimators inside least-squares value iteration; an explicit calculation showing that the total failure probability and sample complexity remain polynomial (without an extra factor exponential in the horizon or number of iterations) is required to support the central efficiency claim.

    Authors: We agree that an explicit calculation strengthens the presentation. In the manuscript the episodic algorithm invokes the replicable regression and covariance estimators inside each of the H iterations of least-squares value iteration. The failure probability and replicability parameter supplied to each estimator call are set to scale as O(1/(H · poly(d, 1/ε))), so that a union bound over the horizon yields an overall failure probability that remains inverse-polynomial while the total sample complexity stays polynomial in H, d and 1/ε with no exponential factor in H. To make this composition fully transparent, the revised version will include a dedicated lemma (placed in the appendix) that spells out the precise parameter choices and the resulting bounds on the aggregate failure probability and sample complexity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper introduces independent replicable estimators for random-design regression and uncentered covariance, then composes them via standard least-squares value/policy iteration into algorithms for linear MDPs. No quoted equations or steps reduce a claimed result to a fitted parameter or self-citation by construction; replicability is enforced directly by the new algorithmic design under the stated linear feature assumptions. The central claims therefore rest on external construction rather than internal redefinition or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests primarily on the standard linear MDP assumption rather than new free parameters or invented entities; replicability definitions are taken from prior literature.

axioms (1)
  • domain assumption The MDP is linear, i.e., transition probabilities and rewards are linear functions of a known feature map
    Invoked to enable linear function approximation throughout the generative and episodic analyses.

pith-pipeline@v0.9.0 · 5703 in / 1147 out tokens · 37662 ms · 2026-05-18T17:13:45.064957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Replicable Composition

    cs.LG 2026-04 unverdicted novelty 8.0

    Replicable algorithms for heterogeneous problems can be composed with O(sum n_i) samples at constant replicability via conversion to perfectly generalizing algorithms, privacy-style composition, and correlated sampling.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    OntheStructureofReplicableHypothesisTesters

    [AACN+25] Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Shyam Narayanan, and Sandeep Silwal.“OntheStructureofReplicableHypothesisTesters”.In:arXiv preprint arXiv:2507.02842 (2025). [ABB24] Saba Ahmadi, Siddharth Bhandari, and Avrim Blum. “Replicable Online Learning”. In:arXiv preprint arXiv:2411.13730(2024). [AJJK+22] Kwangjun Ahn, Prateek Jain, Ziw...

  2. [2]

    Local Rademacher complexities

    Elsevier, 1995, pp. 30–37. [BBM05] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. “Local Rademacher complexities”. In:The Annals of Statistics33.4 (2005), pp. 1497–1537. [BC17] Heinz H. Bauschke and Patrick L. Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd. Springer Publishing Company, Incorporated, 2017.isbn: 33...

  3. [3]

    The Arcade Learning Environment: An Evaluation Platform for General Agents

    [BNVB13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The Arcade Learning Environment: An Evaluation Platform for General Agents”. In:Journal of Artificial Intelligence Research47 (June 2013), pp. 253–279. [BSA90] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. “Neuronlike adaptive elements that can solve difficult learning control...

  4. [4]

    Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments

    [CTFJ19] Kaleigh Clary, Emma Tosch, John Foley, and David D. Jensen. “Let’s Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments”. In:ArXivabs/1904.06312 (2019). 12 [CYJW20] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. “Provably Efficient Exploration in Policy Optimization”. In:Proceedings of the 37th International Confe...

  5. [5]

    Tree-based batch mode reinforcement learning

    [EGW05] Damien Ernst, Pierre Geurts, and Louis Wehenkel. “Tree-based batch mode reinforcement learning”. In:Journal of Machine Learning Research6 (2005). [EHKS23] Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. “Replicable Reinforcement Learning”. In:Advances in Neural Information Processing Systems. Vol

  6. [6]

    Deep Reinforcement Learning and the Deadly Triad

    [GFF15] Javier García, Fern, and o Fernández. “A Comprehensive Survey on Safe Reinforcement Learning”. In:Journal of Machine Learning Research16.42 (2015), pp. 1437–1480. [HDSH+18] Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. “Deep Reinforcement Learning and the Deadly Triad”. In:arXiv preprint arXiv:1...

  7. [7]

    Replicability in High Dimensional Statistics

    [HIKL+24] Max Hopkins, Russell Impagliazzo, Daniel Kane, Sihan Liu, and Christopher Ye. “Replicability in High Dimensional Statistics”. In:arXiv preprint arXiv:2406.02628(2024). [HK70] Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biased estimation for nonorthogonal problems”. In:Technometrics12.1 (1970), pp. 55–67. 13 [HLYY25] Max Hopkins, Siha...

  8. [8]

    Model-Based Reinforcement Learning with Value-Targeted Regression

    [JYSW20] Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. “Model-Based Reinforcement Learning with Value-Targeted Regression”. In:Proceedings of the 2nd Conference on Learning for Dynamics and Control. 2020, pp. 666–686. [JYWJ20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. “Provably efficient reinforcement learning with linear functio...

  9. [9]

    Replicability is Asymptoti- cally Free in Multi-armed Bandits

    [KIYK24] Junpei Komiyama, Shinji Ito, Yuichi Yoshida, and Souta Koshino. “Replicability is Asymptoti- cally Free in Multi-armed Bandits”. In:arXiv preprint arXiv:2402.07391(2024). [KKLV+24] Alkis Kalavasis, Amin Karbasi, Kasper Green Larsen, Grigoris Velegkas, and Felix Zhou. “Replicable Learning of Large-Margin Halfspaces”. In:Proceedings of the 41st Int...

  10. [10]

    Algorithmic Stability and Sanity-Check Bounds for Leave- One-Out Cross-Validation

    [KR99] Michael Kearns and Dana Ron. “Algorithmic Stability and Sanity-Check Bounds for Leave- One-Out Cross-Validation”. In:Neural Computation11.6 (1999), pp. 1427–1453. [KS98] Michael Kearns and Satinder Singh. “Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms”. In:Advances in Neural Information Processing Systems11 (1998). [KVYZ23]...

  11. [11]

    A Vector-Contraction Inequality for Rademacher Complexities

    [Mau16] Andreas Maurer. “A Vector-Contraction Inequality for Rademacher Complexities”. In:Algo- rithmic Learning Theory. 2016, pp. 3–17. [MCKJ+24] Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. “Model- Free Representation Learning and Exploration in Low-Rank MDPs”. In:Journal of Machine Learning Research25.6 (2024), pp. 1–7...

  12. [12]

    Improving Reproducibility in Machine Learning Research(A Report from the NeurIPS 2019 Reproducibility Program)

    [Put94] Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st. USA: John Wiley & Sons, Inc., 1994.isbn: 0471619779. [PVSL+21] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d’Alche-Buc, Emily Fox, and Hugo Larochelle. “Improving Reproducibility in Machine Learning...

  13. [13]

    D3rlpy: An Offline Deep Reinforcement Learning Library

    [SI22] Takuma Seno and Michita Imai. “D3rlpy: An Offline Deep Reinforcement Learning Library”. In: 23.1 (2022). [TS93] Sebastian Thrun and Anton Schwartz. “Issues in Using Function Approximation for Reinforce- ment Learning”. In:Proceedings of the 1993 Connectionist Models Summer School

  14. [14]

    Markov Decision Processes with Imprecise Transition Probabilities

    [WE94] Chelsea C. White and Hany K. Eldeib. “Markov Decision Processes with Imprecise Transition Probabilities”. In:Operations Research42.4 (1994), pp. 739–749. [WWDK21] Yining Wang, Ruosong Wang, Simon Shaolei Du, and Akshay Krishnamurthy. “Optimism in Reinforcement Learning with Generalized Linear Function Approximation”. In:International Conference on ...

  15. [15]

    Siyuan Zhang and Nan Jiang

    [YPBD+24] Omar G. Younis, Rodrigo Perez-Vicente, John U. Balis, Will Dudley, Alex Davey, and Jordan K Terry.Minari. Version 0.5.0. Sept. 2024.doi: 10.5281/zenodo.13767625 .url: https://doi.org/10.5281/zenodo.13767625. [YW19] Lin Yang and Mengdi Wang. “Sample-Optimal Parametric Q-Learning Using Linearly Additive Features”. In:Proceedings of the 36th Intern...

  16. [16]

    Making Linear MDPs Practical via Contrastive Representation Learning

    [ZRYG+22] Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph Gonzalez, Dale Schuurmans, and Bo Dai. “Making Linear MDPs Practical via Contrastive Representation Learning”. In:Proceedings of the 39th International Conference on Machine Learning. 2022, pp. 26447–26466. [ZZG23] Junkai Zhang, Weitong Zhang, and Quanquan Gu. “Optimal Horizon-Free Reward-Free ...