Replicable Reinforcement Learning with Linear Function Approximation
Pith reviewed 2026-05-18 17:13 UTC · model grok-4.3
The pith
Replicable reinforcement learning with linear function approximation is now provably efficient.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By developing efficient replicable algorithms for random design regression and uncentered covariance estimation, the authors obtain the first provably efficient replicable RL algorithms for linear Markov decision processes, applicable in both the generative model and episodic settings.
What carries the argument
Replicable random design regression and replicable uncentered covariance estimation, which produce the same output on any two independent samples from the same distribution.
If this is right
- Replicable RL algorithms exist with polynomial sample complexity for linear MDPs.
- The methods apply to both generative and episodic access models.
- Experiments indicate potential for more consistent policies even with neural networks.
Where Pith is reading between the lines
- Similar replicability techniques could be developed for other function approximation methods in RL.
- These algorithms might serve as a baseline for studying reproducibility in RL research.
- Adopting replicable methods could change how RL results are validated in the field.
Load-bearing premise
The underlying MDP has transition and reward functions that are linear in a known feature map of dimension d.
What would settle it
A direct test where the algorithm is executed on two separate sample sets drawn from the same linear MDP distribution, checking if the returned policies or value estimates are exactly the same.
Figures
read the original abstract
Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces two new algorithms for replicable random-design regression and uncentered covariance estimation. These tools are then composed into the first provably efficient replicable RL algorithms for linear MDPs, covering both the generative-model setting and the episodic setting. The work includes experimental evaluations on neural policies to illustrate improved consistency.
Significance. If the theoretical claims hold, the contribution is significant: it closes an open gap by extending replicability guarantees from tabular RL to the practically relevant linear function-approximation regime while preserving polynomial sample complexity. The standalone regression and covariance estimators are of independent interest and may be reusable in other statistical learning settings. The experimental section provides concrete evidence that the approach can inspire more stable neural policies.
major comments (1)
- [§4.2] §4.2, the episodic algorithm: the replicability parameter for the overall procedure is obtained by composing the regression and covariance estimators inside least-squares value iteration; an explicit calculation showing that the total failure probability and sample complexity remain polynomial (without an extra factor exponential in the horizon or number of iterations) is required to support the central efficiency claim.
minor comments (2)
- [Abstract] The abstract states that the algorithms are evaluated experimentally but gives no information on the environments, baselines, or metrics; a one-sentence summary would improve readability.
- [§2] Notation for the feature dimension d and the replicability parameter ε is introduced in §2 but occasionally reused with different subscripts later; a short notation table would eliminate ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the constructive comment on the episodic algorithm. We address the point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§4.2] §4.2, the episodic algorithm: the replicability parameter for the overall procedure is obtained by composing the regression and covariance estimators inside least-squares value iteration; an explicit calculation showing that the total failure probability and sample complexity remain polynomial (without an extra factor exponential in the horizon or number of iterations) is required to support the central efficiency claim.
Authors: We agree that an explicit calculation strengthens the presentation. In the manuscript the episodic algorithm invokes the replicable regression and covariance estimators inside each of the H iterations of least-squares value iteration. The failure probability and replicability parameter supplied to each estimator call are set to scale as O(1/(H · poly(d, 1/ε))), so that a union bound over the horizon yields an overall failure probability that remains inverse-polynomial while the total sample complexity stays polynomial in H, d and 1/ε with no exponential factor in H. To make this composition fully transparent, the revised version will include a dedicated lemma (placed in the appendix) that spells out the precise parameter choices and the resulting bounds on the aggregate failure probability and sample complexity. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper introduces independent replicable estimators for random-design regression and uncentered covariance, then composes them via standard least-squares value/policy iteration into algorithms for linear MDPs. No quoted equations or steps reduce a claimed result to a fitted parameter or self-citation by construction; replicability is enforced directly by the new algorithmic design under the stated linear feature assumptions. The central claims therefore rest on external construction rather than internal redefinition or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The MDP is linear, i.e., transition probabilities and rewards are linear functions of a known feature map
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation... We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.equivNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Algorithm 2 R-Ridge-Regression... Compute bw = (∑ x_i x_i^T + λI)^{-1} ∑ x_j y_j then w = R-Hypergrid-Rounding(bw, r, ...)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Replicable Composition
Replicable algorithms for heterogeneous problems can be composed with O(sum n_i) samples at constant replicability via conversion to perfectly generalizing algorithms, privacy-style composition, and correlated sampling.
Reference graph
Works this paper leans on
-
[1]
OntheStructureofReplicableHypothesisTesters
[AACN+25] Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Shyam Narayanan, and Sandeep Silwal.“OntheStructureofReplicableHypothesisTesters”.In:arXiv preprint arXiv:2507.02842 (2025). [ABB24] Saba Ahmadi, Siddharth Bhandari, and Avrim Blum. “Replicable Online Learning”. In:arXiv preprint arXiv:2411.13730(2024). [AJJK+22] Kwangjun Ahn, Prateek Jain, Ziw...
-
[2]
Elsevier, 1995, pp. 30–37. [BBM05] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. “Local Rademacher complexities”. In:The Annals of Statistics33.4 (2005), pp. 1497–1537. [BC17] Heinz H. Bauschke and Patrick L. Combettes.Convex Analysis and Monotone Operator Theory in Hilbert Spaces. 2nd. Springer Publishing Company, Incorporated, 2017.isbn: 33...
work page 1995
-
[3]
The Arcade Learning Environment: An Evaluation Platform for General Agents
[BNVB13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The Arcade Learning Environment: An Evaluation Platform for General Agents”. In:Journal of Artificial Intelligence Research47 (June 2013), pp. 253–279. [BSA90] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. “Neuronlike adaptive elements that can solve difficult learning control...
work page 2013
-
[4]
Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments
[CTFJ19] Kaleigh Clary, Emma Tosch, John Foley, and David D. Jensen. “Let’s Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments”. In:ArXivabs/1904.06312 (2019). 12 [CYJW20] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. “Provably Efficient Exploration in Policy Optimization”. In:Proceedings of the 37th International Confe...
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
Tree-based batch mode reinforcement learning
[EGW05] Damien Ernst, Pierre Geurts, and Louis Wehenkel. “Tree-based batch mode reinforcement learning”. In:Journal of Machine Learning Research6 (2005). [EHKS23] Eric Eaton, Marcel Hussing, Michael Kearns, and Jessica Sorrell. “Replicable Reinforcement Learning”. In:Advances in Neural Information Processing Systems. Vol
work page 2005
-
[6]
Deep Reinforcement Learning and the Deadly Triad
[GFF15] Javier García, Fern, and o Fernández. “A Comprehensive Survey on Safe Reinforcement Learning”. In:Journal of Machine Learning Research16.42 (2015), pp. 1437–1480. [HDSH+18] Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. “Deep Reinforcement Learning and the Deadly Triad”. In:arXiv preprint arXiv:1...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Replicability in High Dimensional Statistics
[HIKL+24] Max Hopkins, Russell Impagliazzo, Daniel Kane, Sihan Liu, and Christopher Ye. “Replicability in High Dimensional Statistics”. In:arXiv preprint arXiv:2406.02628(2024). [HK70] Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biased estimation for nonorthogonal problems”. In:Technometrics12.1 (1970), pp. 55–67. 13 [HLYY25] Max Hopkins, Siha...
-
[8]
Model-Based Reinforcement Learning with Value-Targeted Regression
[JYSW20] Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. “Model-Based Reinforcement Learning with Value-Targeted Regression”. In:Proceedings of the 2nd Conference on Learning for Dynamics and Control. 2020, pp. 666–686. [JYWJ20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. “Provably efficient reinforcement learning with linear functio...
work page 2020
-
[9]
Replicability is Asymptoti- cally Free in Multi-armed Bandits
[KIYK24] Junpei Komiyama, Shinji Ito, Yuichi Yoshida, and Souta Koshino. “Replicability is Asymptoti- cally Free in Multi-armed Bandits”. In:arXiv preprint arXiv:2402.07391(2024). [KKLV+24] Alkis Kalavasis, Amin Karbasi, Kasper Green Larsen, Grigoris Velegkas, and Felix Zhou. “Replicable Learning of Large-Margin Halfspaces”. In:Proceedings of the 41st Int...
-
[10]
Algorithmic Stability and Sanity-Check Bounds for Leave- One-Out Cross-Validation
[KR99] Michael Kearns and Dana Ron. “Algorithmic Stability and Sanity-Check Bounds for Leave- One-Out Cross-Validation”. In:Neural Computation11.6 (1999), pp. 1427–1453. [KS98] Michael Kearns and Satinder Singh. “Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms”. In:Advances in Neural Information Processing Systems11 (1998). [KVYZ23]...
work page 1999
-
[11]
A Vector-Contraction Inequality for Rademacher Complexities
[Mau16] Andreas Maurer. “A Vector-Contraction Inequality for Rademacher Complexities”. In:Algo- rithmic Learning Theory. 2016, pp. 3–17. [MCKJ+24] Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. “Model- Free Representation Learning and Exploration in Low-Rank MDPs”. In:Journal of Machine Learning Research25.6 (2024), pp. 1–7...
work page 2016
-
[12]
[Put94] Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st. USA: John Wiley & Sons, Inc., 1994.isbn: 0471619779. [PVSL+21] Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer, Florence d’Alche-Buc, Emily Fox, and Hugo Larochelle. “Improving Reproducibility in Machine Learning...
work page 1994
-
[13]
D3rlpy: An Offline Deep Reinforcement Learning Library
[SI22] Takuma Seno and Michita Imai. “D3rlpy: An Offline Deep Reinforcement Learning Library”. In: 23.1 (2022). [TS93] Sebastian Thrun and Anton Schwartz. “Issues in Using Function Approximation for Reinforce- ment Learning”. In:Proceedings of the 1993 Connectionist Models Summer School
work page 2022
-
[14]
Markov Decision Processes with Imprecise Transition Probabilities
[WE94] Chelsea C. White and Hany K. Eldeib. “Markov Decision Processes with Imprecise Transition Probabilities”. In:Operations Research42.4 (1994), pp. 739–749. [WWDK21] Yining Wang, Ruosong Wang, Simon Shaolei Du, and Akshay Krishnamurthy. “Optimism in Reinforcement Learning with Generalized Linear Function Approximation”. In:International Conference on ...
work page 1994
-
[15]
[YPBD+24] Omar G. Younis, Rodrigo Perez-Vicente, John U. Balis, Will Dudley, Alex Davey, and Jordan K Terry.Minari. Version 0.5.0. Sept. 2024.doi: 10.5281/zenodo.13767625 .url: https://doi.org/10.5281/zenodo.13767625. [YW19] Lin Yang and Mengdi Wang. “Sample-Optimal Parametric Q-Learning Using Linearly Additive Features”. In:Proceedings of the 36th Intern...
-
[16]
Making Linear MDPs Practical via Contrastive Representation Learning
[ZRYG+22] Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph Gonzalez, Dale Schuurmans, and Bo Dai. “Making Linear MDPs Practical via Contrastive Representation Learning”. In:Proceedings of the 39th International Conference on Machine Learning. 2022, pp. 26447–26466. [ZZG23] Junkai Zhang, Weitong Zhang, and Quanquan Gu. “Optimal Horizon-Free Reward-Free ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.