arxiv: 2605.01242 · v1 · submitted 2026-05-02 · 💻 cs.LG

Recognition: unknown

Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

Ruiquan Huang , Donghao Li , Yingbin Liang , Jing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningactor-criticlow-rank MDPspolicy evaluationsample complexityfunction approximationoptimistic algorithms

0 comments

The pith

An optimistic actor-critic algorithm achieves better sample efficiency in low-rank MDPs by using only the policy evaluation oracle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that policy evaluation is the cheapest oracle among common RL assumptions when supervised learning is tractable. It introduces an optimistic actor-critic procedure that performs both evaluation and improvement steps through this oracle alone. The resulting method improves existing sample-complexity bounds for low-rank MDPs and removes the need for planning or optimization oracles that prior algorithms require. The theory extends to approximately low-rank MDPs, which the authors argue cover many practical environments, and is checked with experiments on standard Gym tasks.

Core claim

We propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments.

What carries the argument

The optimistic actor-critic algorithm that performs policy improvement and evaluation using only calls to a policy evaluation oracle, with optimism to encourage exploration.

Load-bearing premise

The MDP is low-rank or approximately low-rank and that supervised learning can efficiently implement the policy evaluation oracle.

What would settle it

An experiment or calculation showing that the algorithm's sample complexity is no better than prior methods that use planning oracles, or that policy evaluation is not computationally cheaper than those oracles in practice.

Figures

Figures reproduced from arXiv: 2605.01242 by Donghao Li, Jing Yang, Ruiquan Huang, Yingbin Liang.

**Figure 1.** Figure 1: Evaluations over training. Steps before the dash line is uniform random exploration. y-axis is the value of one episode. Results are averaged across 4 different random seeds. 29 view at source ↗

read the original abstract

Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives an optimistic actor-critic for low-rank MDPs that uses only a policy evaluation oracle and claims better sample complexity than methods needing planning oracles.

read the letter

The main takeaway is that the authors build an actor-critic around the lightest oracle in their hierarchy and still improve the sample bounds for low-rank MDPs while dropping the heavier planning and optimization oracles used before. They treat supervised learning as the cost proxy, rank the oracles accordingly, design the algorithm to call only policy evaluation, prove the rates, extend the argument to approximate low-rank MDPs, and run quick checks on Gym tasks. That combination is the concrete advance over prior low-rank MDP work. The hierarchy itself is a clean way to make the computational claim explicit rather than implicit. The approximate low-rank extension is a reasonable step toward broader applicability, and the experiments at least confirm the method runs on standard environments without obvious breakage. The proofs appear internally consistent with the stated assumptions, and the oracle reduction is a direct consequence of how they set up the algorithm. The soft spots are in the scale of the improvement and the experiments. The sample-complexity gain will depend on the precise dependence on rank, horizon, and function-class parameters; if those constants are not markedly better than the baselines once written out, the practical edge narrows. The Gym tests are on small, low-dimensional tasks, so they do not probe whether the method remains stable when the low-rank structure is only approximate in higher-dimensional or more realistic settings. This is for readers already working on sample-efficient RL with function approximation who care about matching statistical rates to computational cost. Someone who follows low-rank MDP papers will see the difference immediately. It deserves a serious referee because the claims are specific, the algorithm is well-defined, and there is both a proof sketch and empirical support to evaluate. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The paper establishes a hierarchy of RL oracles under low-rank MDPs using supervised learning as a computational proxy, demonstrating that policy evaluation is the most efficient oracle. It introduces an optimistic actor-critic algorithm that relies solely on the policy evaluation oracle, proves improved sample complexity bounds compared to prior methods that use planning or optimization oracles, extends the analysis to approximately low-rank MDPs, and validates the results empirically on standard Gym environments.

Significance. If the derivations hold, the work is significant for advancing computationally tractable provably efficient RL in low-rank settings. The oracle hierarchy provides a clear computational ranking, the actor-critic design avoids expensive oracles while improving sample complexity, the approximate low-rank extension broadens relevance to real-world problems, and the empirical validation on Gym tasks offers practical support.

minor comments (3)

[Abstract] The abstract states that the algorithm 'outperforms the existing sample complexity guarantees' but does not specify the quantitative improvement or cite the exact prior bounds being improved upon; adding a brief comparison would strengthen the claim.
[Theoretical Results] The extension to approximately low-rank MDPs is presented as capturing broad real-world environments, but the manuscript should clarify the approximation error tolerance and how it affects the sample complexity bound in the main theorem.
[Experiments] In the experimental section, details on how the low-rank structure is enforced or approximated in the Gym environments (e.g., via feature construction or rank estimation) would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We are grateful to the referee for the detailed summary of our manuscript and for recommending minor revision. The referee correctly captures the main contributions: the oracle hierarchy under low-rank MDPs, the optimistic actor-critic algorithm using only the policy evaluation oracle with improved sample complexity, the extension to approximately low-rank MDPs, and the empirical validation. As no specific major comments are listed, we do not have point-by-point responses. We will address any minor issues in the revised version.

Circularity Check

0 steps flagged

No significant circularity in oracle hierarchy or sample-complexity derivation

full rationale

The paper first defines a computational hierarchy among RL oracles by treating supervised learning as an independent proxy under the low-rank MDP assumption; this step does not presuppose the final algorithm or its bounds. The optimistic actor-critic is then explicitly constructed to invoke only the policy-evaluation oracle, and the sample-complexity proof proceeds from standard concentration and optimism arguments that do not reduce to a fitted parameter or self-referential definition. The approximate-low-rank extension is stated as a separate, explicit relaxation. No load-bearing self-citation, ansatz smuggling, or renaming of known results is required for the central claim. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the low-rank MDP assumption and the premise that supervised learning is computationally efficient; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption The MDP has (approximately) low-rank structure
Invoked to obtain sample-complexity guarantees and to justify the oracle hierarchy.
domain assumption Supervised learning can be solved efficiently
Used as the computational proxy that makes policy evaluation the cheapest oracle.

pith-pipeline@v0.9.0 · 5485 in / 1314 out tokens · 30686 ms · 2026-05-09T15:09:27.503467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

Benchmarking Model-Based Reinforcement Learning

Benchmarking model-based reinforcement learning , author=. arXiv preprint arXiv:1907.02057 , year=

work page Pith review arXiv 1907
[10]

arXiv preprint arXiv:2502.08632 , year=

Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning , author=. arXiv preprint arXiv:2502.08632 , year=

work page arXiv
[11]

Advances in neural information processing systems , volume=

Random features for large-scale kernel machines , author=. Advances in neural information processing systems , volume=
[12]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A standard interface for reinforcement learning environments , author=. arXiv preprint arXiv:2407.17032 , year=

work page internal anchor Pith review arXiv
[13]

Journal of Optimization Theory and Applications , volume=

On a computationally ill-behaved bilevel problem with a continuous and nonconvex lower level , author=. Journal of Optimization Theory and Applications , volume=. 2023 , publisher=

2023
[14]

Advances in Neural Information Processing Systems , volume=

Enhanced bilevel optimization via bregman distance , author=. Advances in Neural Information Processing Systems , volume=
[15]

Advances in Neural Information Processing Systems , volume=

Provably efficient q-learning with low switching cost , author=. Advances in Neural Information Processing Systems , volume=
[16]

Advances in neural information processing systems , volume=

Near-optimal regret bounds for reinforcement learning , author=. Advances in neural information processing systems , volume=
[17]

International conference on machine learning , pages=

Minimax regret bounds for reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[18]

Advances in neural information processing systems , volume=

Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=
[19]

International Conference on Artificial Intelligence and Statistics , pages=

On the model-misspecification in reinforcement learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024
[20]

Uncertainty in Artificial Intelligence , pages=

A free lunch from the noise: Provable and practical exploration for representation learning , author=. Uncertainty in Artificial Intelligence , pages=. 2022 , organization=

2022
[21]

International Conference on Machine Learning , pages=

Making linear mdps practical via contrastive representation learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[22]

International Conference on Machine Learning , pages=

Ucb momentum q-learning: Correcting the bias without forgetting , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[23]

Advances in neural information processing systems , volume=

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , author=. Advances in neural information processing systems , volume=
[24]

NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists , year=

Optimistic Actor-Critic with Parametric Policies: Unifying Sample Efficiency and Practicality , author=. NeurIPS 2025 Workshop: Second Workshop on Aligning Reinforcement Learning Experimentalists and Theorists , year=

2025
[25]

arXiv preprint arXiv:2505.03710 , year=

Actor-Critics Can Achieve Optimal Sample Efficiency , author=. arXiv preprint arXiv:2505.03710 , year=

work page arXiv
[26]

Advances in Neural Information Processing Systems , volume=

Efficient model-free exploration in low-rank mdps , author=. Advances in Neural Information Processing Systems , volume=
[27]

arXiv preprint arXiv:2303.10859 , year=

Improved sample complexity for reward-free reinforcement learning under low-rank mdps , author=. arXiv preprint arXiv:2303.10859 , year=

work page arXiv
[28]

2021 , eprint=

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , author=. 2021 , eprint=

2021
[29]

CoRR , volume =

Weitong Zhang and Jiafan He and Dongruo Zhou and Amy Zhang and Quanquan Gu , title =. CoRR , volume =
[30]

2021 , eprint=

A Simple Reward-free Approach to Constrained Reinforcement Learning , author=. 2021 , eprint=

2021
[31]

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author=
[32]

Eitan Altman and Inmanysituationsintheoptimizationofdynamicsystems Asingleutility , title =
[33]

Proceedings of the 37th International Conference on Machine Learning , pages =

Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =
[34]

Proceedings of the 38th International Conference on Machine Learning , pages =

Near Optimal Reward-Free Reinforcement Learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =
[35]

FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , volume =

Agarwal, Alekh and Kakade, Sham and Krishnamurthy, Akshay and Sun, Wen , booktitle =. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , volume =
[36]

arXiv preprint arXiv:2110.04652 , year=

Representation Learning for Online and Offline RL in Low-rank MDPs , author=. arXiv preprint arXiv:2110.04652 , year=

work page arXiv
[37]

International Conference on Machine Learning , pages=

Contextual decision processes with low Bellman rank are PAC-learnable , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[38]

arXiv preprint arXiv:2006.11274 , year=

On reward-free reinforcement learning with linear function approximation , author=. arXiv preprint arXiv:2006.11274 , year=

work page arXiv 2006
[39]

Conference on Learning Theory , pages=

Provably efficient reinforcement learning with linear function approximation , author=. Conference on Learning Theory , pages=. 2020 , organization=

2020
[40]

arXiv preprint arXiv:2102.07035 , year=

Model-free representation learning and exploration in low-rank MDPs , author=. arXiv preprint arXiv:2102.07035 , year=

work page arXiv
[41]

International Conference on Machine Learning , pages=

Learning near optimal policies with low inherent bellman error , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020
[42]

arXiv preprint arXiv:1703.07710 , year=

Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning , author=. arXiv preprint arXiv:1703.07710 , year=

work page arXiv
[43]

Advances in Neural Information Processing Systems , volume=

Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation , author=. Advances in Neural Information Processing Systems , volume=
[44]

Conference on learning theory , pages=

Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches , author=. Conference on learning theory , pages=. 2019 , organization=

2019
[45]

Nearly minimax optimal reward-free reinforcement learning,

Nearly Minimax Optimal Reward-free Reinforcement Learning , author=. arXiv preprint arXiv:2010.05901 , year=

work page arXiv 2010
[46]

International Conference on Machine Learning , pages=

Fast active learning for pure exploration in reinforcement learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[47]

Algorithmic Learning Theory , pages=

Adaptive reward-free exploration , author=. Algorithmic Learning Theory , pages=. 2021 , organization=

2021
[48]

arXiv preprint arXiv:2008.07737 , year=

Provably efficient reward-agnostic navigation with linear value iteration , author=. arXiv preprint arXiv:2008.07737 , year=

work page arXiv 2008
[49]

International Conference on Machine Learning , pages=

Model-based reinforcement learning with value-targeted regression , author=. International Conference on Machine Learning , pages=. 2020 , organization=

2020
[50]

International Conference on Machine Learning , pages=

Provably efficient maximum entropy exploration , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[51]

International Conference on Machine Learning , pages=

Provably efficient RL with rich observations via latent state decoding , author=. International Conference on Machine Learning , pages=
[52]

International conference on machine learning , pages=

Kinematic state abstraction and provably efficient rich-observation reinforcement learning , author=. International conference on machine learning , pages=
[53]

IEEE transactions on evolutionary computation , volume=

Intrinsic motivation systems for autonomous mental development , author=. IEEE transactions on evolutionary computation , volume=. 2007 , publisher=

2007
[54]

Advances in neural information processing systems , volume=

Unifying count-based exploration and intrinsic motivation , author=. Advances in neural information processing systems , volume=
[55]

arXiv preprint arXiv:2103.10897 , year=

Bilinear classes: A structural framework for provable generalization in rl , author=. arXiv preprint arXiv:2103.10897 , year=

work page arXiv
[56]

arXiv preprint arXiv:1903.03698 , year=

Skew-fit: State-covering self-supervised reinforcement learning , author=. arXiv preprint arXiv:1903.03698 , year=

work page arXiv 1903
[57]

Eysenbach, A

Diversity is all you need: Learning skills without a reward function , author=. arXiv preprint arXiv:1802.06070 , year=

work page arXiv
[58]

Visual Reinforcement Learning with Imagined Goals

Visual reinforcement learning with imagined goals , author=. arXiv preprint arXiv:1807.04742 , year=

work page Pith review arXiv
[59]

International Conference on Machine Learning , pages=

Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[60]

Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

Exploration by random network distillation , author=. arXiv preprint arXiv:1810.12894 , year=

work page arXiv
[61]

2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) , pages=

Pseudo-MDPs and factored linear action models , author=. 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) , pages=. 2014 , organization=

2014
[62]

Proceedings of the 19th international conference on World wide web , pages=

Factorizing personalized markov chains for next-basket recommendation , author=. Proceedings of the 19th international conference on World wide web , pages=
[63]

arXiv preprint arXiv:2202.00063 , year=

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach , author=. arXiv preprint arXiv:2202.00063 , year=

work page arXiv
[64]

Kakade and Jason D

Simon Shaolei Du and Wei Hu and Sham M. Kakade and Jason D. Lee and Qi Lei , title =. 9th International Conference on Learning Representations,
[65]

Advances in Neural Information Processing Systems , volume=

Accommodating picky customers: Regret bound and exploration complexity for multi-objective reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[66]

CoRR , volume =

Yuan Cheng and Songtao Feng and Jing Yang and Hong Zhang and Yingbin Liang , title =. CoRR , volume =
[67]

CoRR , volume =

Jinglin Chen and Aditya Modi and Akshay Krishnamurthy and Nan Jiang and Alekh Agarwal , title =. CoRR , volume =
[68]

CoRR , volume =

Lingxiao Wang and Qi Cai and Zhuoran Yang and Zhaoran Wang , title =. CoRR , volume =
[69]

arXiv preprint arXiv:2206.14057 , year=

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL , author=. arXiv preprint arXiv:2206.14057 , year=

work page arXiv
[70]

The Tenth International Conference on Learning Representations,

Masatoshi Uehara and Xuezhou Zhang and Wen Sun , title =. The Tenth International Conference on Learning Representations,
[71]

arXiv preprint arXiv:2206.12020 , year=

Provably efficient reinforcement learning in partially observable dynamical systems , author=. arXiv preprint arXiv:2206.12020 , year=

work page arXiv
[72]

arXiv preprint arXiv:2208.09515 , year=

Spectral Decomposition Representation for Reinforcement Learning , author=. arXiv preprint arXiv:2208.09515 , year=

work page arXiv
[73]

arXiv preprint arXiv:2207.05738 , year=

PAC Reinforcement Learning for Predictive State Representations , author=. arXiv preprint arXiv:2207.05738 , year=

work page arXiv
[74]

arXiv preprint arXiv:2205.14571 , year=

Provable Benefits of Representational Transfer in Reinforcement Learning , author=. arXiv preprint arXiv:2205.14571 , year=

work page arXiv
[75]

arXiv preprint arXiv:2106.08053 , year=

On the power of multitask representation learning in linear mdp , author=. arXiv preprint arXiv:2106.08053 , year=

work page arXiv
[76]

Lee and Simon Shaolei Du , title =

Jiaqi Yang and Wei Hu and Jason D. Lee and Simon Shaolei Du , title =. 9th International Conference on Learning Representations,
[77]

Conference on Learning Theory , pages=

Cautiously optimistic policy optimization and exploration with linear function approximation , author=. Conference on Learning Theory , pages=. 2021 , organization=

2021
[78]

arXiv preprint arXiv:2209.14997 , year=

Optimistic MLE--A Generic Model-based Algorithm for Partially Observable Sequential Decision Making , author=. arXiv preprint arXiv:2209.14997 , year=

work page arXiv
[79]

arXiv preprint arXiv:2209.14990 , year=

Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms , author=. arXiv preprint arXiv:2209.14990 , year=

work page arXiv
[80]

Rusu and Neil C

Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell , title =. CoRR , volume =

Showing first 80 references.