arxiv: 2605.06373 · v1 · submitted 2026-05-07 · 📊 stat.ML · cs.LG

Recognition: unknown

Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under τ-Mixing

Leon Halgryn (1) , Sophie Langer (2) , Janusz M. Meylahn (1) , E. Moritz Hahn (1) ((1) University of Twente , (2) Ruhr-Universit\"at Bochum)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords deep Q-learningfinite-sample boundsτ-mixingtemporal dependencereplay buffernonparametric regressionsample complexityexperience replay

0 comments

The pith

Deep Q-learning maintains finite-sample guarantees when replay data is treated as τ-mixing rather than independent, though with a slower rate due to an added dimensionality penalty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the independence assumption in analyses of deep Q-learning can be relaxed to allow for temporal dependence modeled by τ-mixing. It proves that minibatches from replay buffers inherit τ-mixing properties under suitable trajectory and sampling conditions. Treating network updates as nonparametric regression on these dependent samples yields risk bounds with an extra penalty in the rate exponent. This penalty captures how dependence lowers the effective number of samples. The work also provides sample complexity results and validates the mixing behavior empirically on standard environments.

Core claim

We extend finite-sample analyses of DQN with ReLU networks to τ-mixing data by modeling minibatches as τ-mixing sequences. This leads to risk bounds where temporal dependence induces an additional dimensionality penalty in the statistical rate, reflecting the reduced effective sample size, and we derive the corresponding sample complexity for the algorithm.

What carries the argument

τ-mixing property of the replay minibatches, which quantifies temporal dependence and allows derivation of risk bounds by adjusting independent-data analyses for the mixing coefficients.

Load-bearing premise

Minibatches sampled from replay buffers of τ-mixing trajectories continue to be τ-mixing when the trajectories and sampling mechanism satisfy the paper's conditions.

What would settle it

Measuring the scaling of the approximation error or risk in DQN as a function of sample size n on a controlled Markov decision process, checking if the exponent matches the predicted penalty for a given mixing time τ.

Figures

Figures reproduced from arXiv: 2605.06373 by (2) Ruhr-Universit\"at Bochum), E. Moritz Hahn (1) ((1) University of Twente, Janusz M. Meylahn (1), Leon Halgryn (1), Sophie Langer (2).

**Figure 1.** Figure 1: Estimated finite-history τ -dependence proxies for state–action trajectories generated by fixed trained policies (top row) and for replay minibatches sampled using our contiguous block sampler (see Section C.4) used in DQN updates (bottom row). Trajectory estimates show decaying temporal dependence in the underlying RL process, while minibatch estimates show that replay sampling attenuates but does not eli… view at source ↗

read the original abstract

Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $\tau$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $\tau$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $\tau$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives finite-sample DQN bounds under tau-mixing replay data and shows an explicit extra dimensionality penalty in the rate from reduced effective sample size.

read the letter

The main takeaway is that they replace the usual independence assumption on replay minibatches with a tau-mixing model and derive the corresponding finite-sample risk bounds for ReLU DQN. This produces a concrete degradation in the statistical rate that reflects the dependence, plus a sample-complexity expression pulled from those bounds. They also run standard Gym environments to check that replay correlations decay roughly exponentially, which lines up with the mixing setup they use. That empirical check is useful and keeps the work grounded. The extension itself is straightforward once you accept the mixing premise: they map each update to a nonparametric regression problem with dependent observations and apply existing mixing results with an adjusted rate exponent. The derivation looks clean on the surface and the citation pattern to prior i.i.d. DQN analyses is appropriate. The load-bearing piece is whether uniform sampling from the replay buffer actually preserves tau-mixing with a usable decay rate. The abstract states that this holds under stated conditions on the trajectories and the sampling rule, but the argument needs to be tight because overlapping trajectory segments could introduce longer-range dependence than a fixed lag would allow. If that step is handled carefully in the proofs, the rate penalty follows; if not, the bounds apply only in more restricted regimes. The ReLU approximation steps appear to follow the usual nonparametric template without new complications. This is aimed at people who care about sample-efficiency calculations in deep RL when data are sequential rather than i.i.d. A reader already working on mixing processes or theoretical RL would find the explicit penalty and the empirical correlation check worth seeing. It is coherent on its own terms and addresses a real gap, so it deserves a serious referee who can check the mixing-preservation details.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide finite-sample risk bounds for Deep Q-Learning by modeling DQN minibatches drawn from replay buffers as τ-mixing sequences rather than i.i.d. It extends existing nonparametric regression analyses for fully connected ReLU networks to this dependent-data setting, derives that temporal dependence induces an extra dimensionality penalty in the convergence rate (reflecting reduced effective sample size), obtains the corresponding sample complexity, and supplies empirical evidence from Gymnasium environments that replay samples exhibit exponentially decaying correlations.

Significance. If the central modeling step is rigorously justified, the work would be significant for relaxing the unrealistic independence assumption that pervades finite-sample RL theory. It adapts standard nonparametric mixing regression results to DQN, produces explicit rate degradation, and includes empirical support for the mixing model. These elements address a practically relevant gap and yield falsifiable predictions about statistical rates under dependence.

major comments (2)

[Abstract] Abstract (paragraph on modeling minibatches as τ-mixing): The assertion that 'this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches' is load-bearing for the entire rate-degradation claim. The manuscript must explicitly verify that uniform sampling from a replay buffer containing overlapping trajectory segments preserves a quantifiable τ-mixing rate; if overlapping segments induce non-decaying long-range dependence, the effective-sample-size adjustment and the additional dimensionality penalty in the exponent no longer follow from existing mixing regression bounds.
[Risk bounds derivation] Section deriving the risk bounds (nonparametric regression formulation): The extension of independent-data ReLU regression bounds to τ-mixing observations must state the precise dependence of the rate exponent on the mixing coefficients; without an explicit effective-sample-size formula that accounts for the replay-buffer sampling process, it is unclear whether the claimed dimensionality penalty is the only degradation or whether slower mixing produces a qualitatively worse exponent.

minor comments (2)

[Experiments] The empirical section should report the exact procedure used to estimate correlations (e.g., lag selection, number of trajectories) and include confidence bands on the decay plots to allow readers to assess how closely the observed decay matches the theoretical τ-mixing assumption.
[Notation] Notation for the mixing coefficient τ and the effective sample size should be introduced once and used consistently when transitioning from the regression bounds to the DQN sample-complexity statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive major comments. The feedback correctly identifies that the central modeling step and the explicit form of the rate degradation require additional clarification to fully support the claims. We address each point below and will incorporate the requested details in the revision.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on modeling minibatches as τ-mixing): The assertion that 'this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches' is load-bearing for the entire rate-degradation claim. The manuscript must explicitly verify that uniform sampling from a replay buffer containing overlapping trajectory segments preserves a quantifiable τ-mixing rate; if overlapping segments induce non-decaying long-range dependence, the effective-sample-size adjustment and the additional dimensionality penalty in the exponent no longer follow from existing mixing regression bounds.

Authors: We agree that the preservation of a quantifiable τ-mixing rate under replay-buffer sampling is load-bearing and that the manuscript's statement of the assumption under 'certain dependence conditions' would benefit from an explicit verification. The current text relies on the modeling choice together with the empirical observation of exponentially decaying correlations in standard Gymnasium environments. In the revision we will add a lemma (placed in the appendix) that derives the mixing coefficients for uniform sampling from a finite replay buffer. Under the maintained assumption that the underlying state-action trajectory is geometrically ergodic (hence τ-mixing with exponential decay), the lemma shows that the minibatch sequence remains τ-mixing with a rate that is a continuous function of the original decay rate and the buffer size; overlapping segments do not produce non-decaying long-range dependence. The lemma also identifies the (mild) additional conditions on buffer size and sampling frequency that guarantee the effective-sample-size adjustment continues to hold, thereby justifying the dimensionality penalty in the rate exponent. revision: yes
Referee: [Risk bounds derivation] Section deriving the risk bounds (nonparametric regression formulation): The extension of independent-data ReLU regression bounds to τ-mixing observations must state the precise dependence of the rate exponent on the mixing coefficients; without an explicit effective-sample-size formula that accounts for the replay-buffer sampling process, it is unclear whether the claimed dimensionality penalty is the only degradation or whether slower mixing produces a qualitatively worse exponent.

Authors: The referee correctly notes that the dependence of the rate exponent on the mixing coefficients must be stated explicitly. The derivation applies standard nonparametric regression bounds for τ-mixing sequences, in which the excess-risk rate takes the form O(n_eff^{-2β/(2β+d)}) (up to logarithmic factors), where n_eff = n / (1 + 2 ∑_{k=1}^∞ τ(k)) and τ(k) are the mixing coefficients of the minibatch sequence. The additional dimensionality penalty therefore arises solely because the reduced effective sample size enters the exponent. We acknowledge that the manuscript does not supply the explicit effective-sample-size expression specialized to the replay-buffer sampling mechanism. In the revision we will insert this formula immediately after the statement of the risk bound, together with the relation between τ(k) and the replay-buffer parameters obtained from the new lemma. The revised text will also note that if the mixing coefficients decay more slowly than the geometric rate assumed in the lemma, the exponent deteriorates further in a quantifiable way; under the stated conditions, however, the dimensionality penalty is the only degradation beyond the i.i.d. case. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds extend external nonparametric results for τ-mixing sequences

full rationale

The derivation models DQN minibatches as τ-mixing under explicit conditions on trajectories and sampling, then applies standard finite-sample risk bounds from nonparametric regression theory for dependent data. These bounds are obtained by adjusting effective sample size in existing mixing-process results rather than by re-deriving or fitting quantities internal to the paper. No step reduces a claimed prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness or ansatz claim, or renames an empirical pattern as a new derivation. The central rate-degradation statement follows directly from the external mixing bounds once the modeling assumption is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Rests on τ-mixing of replay minibatches and ReLU nonparametric rates; no new entities introduced.

axioms (2)

domain assumption Replay minibatches remain τ-mixing
Justifies applying mixing regression bounds to DQN.
standard math ReLU networks admit standard nonparametric rates
Translates regression bounds to DQN guarantees.

pith-pipeline@v0.9.0 · 8953 in / 1037 out tokens · 63677 ms · 2026-05-08T04:57:19.387992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Nature , volume =

Human-level control through deep reinforcement learning , author =. Nature , volume =
[2]

Proceedings of the 2nd Conference on Learning for Dynamics and Control , pages =

A Theoretical Analysis of Deep Q-Learning , author =. Proceedings of the 2nd Conference on Learning for Dynamics and Control , pages =. 2020 , editor =

2020
[3]

Science , volume =

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , author =. Science , volume =
[4]

Probability Theory and Related Fields , author =

New dependence coefficients. Probability Theory and Related Fields , author =. 2005 , pages =. doi:10.1007/s00440-004-0394-3 , number =

work page doi:10.1007/s00440-004-0394-3 2005
[5]

Proceedings of the European Conference on Machine Learning , year =

Riedmiller, Martin , title =. Proceedings of the European Conference on Machine Learning , year =
[6]

Probability Theory and Related Fields , author =

A. Probability Theory and Related Fields , author =. 2011 , pages =. doi:10.1007/s00440-010-0304-9 , number =

work page doi:10.1007/s00440-010-0304-9 2011
[7]

A Survey and Some Open Questions , author=

Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions , author=. Probability Surveys , volume=. 2005 , doi=

2005
[8]

Generalization Bounds of Deep Neural Networks With -Mixing Samples , year=

Liu, Liyuan and Chen, Yaohui and Li, Weifu and Wang, Yingjie and Gu, Bin and Zheng, Feng and Chen, Hong , journal=. Generalization Bounds of Deep Neural Networks With -Mixing Samples , year=
[9]

IEEE Transactions on Neural Networks and Learning Systems , volume=

Generalization Bounds of Deep Neural Networks With -Mixing Samples , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. 2025 , publisher=

2025
[10]

2024 , eprint=

A Simple Note on the Basic Properties of Subgaussian Random Variables , author=. 2024 , eprint=

2024
[11]

Journal of Theoretical Probability , year=

Coupling for -Dependent Sequences and Applications , author=. Journal of Theoretical Probability , year=
[12]

The Annals of Statistics , number =

Yuling Jiao and Guohao Shen and Yuanyuan Lin and Jian Huang , title =. The Annals of Statistics , number =. 2023 , doi =

2023
[13]

Covering Numbers for Deep

Ou, Weigutian and B. Covering Numbers for Deep. Foundations of Computational Mathematics , year =
[14]

Nonparametric regression using deep neural networks with ReLU activation function , volume=

Schmidt-Hieber, Johannes , year=. Nonparametric regression using deep neural networks with. The Annals of Statistics , publisher=. doi:10.1214/19-aos1875 , number=

work page doi:10.1214/19-aos1875
[15]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A Standard Interface for Reinforcement Learning Environments , author=. arXiv preprint arXiv:2407.17032 , year=

work page internal anchor Pith review arXiv
[16]

Machine Learning , year=

Asynchronous Stochastic Approximation and Q-Learning , author=. Machine Learning , year=
[17]

Convergence of Stochastic Iterative Dynamic Programming Algorithms , volume =

Jaakkola, Tommi and Jordan, Michael and Singh, Satinder , booktitle =. Convergence of Stochastic Iterative Dynamic Programming Algorithms , volume =
[18]

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , volume =

Jaakkola, Tommi and Jordan, Michael and Singh, Satinder , year =. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , volume =. Neural Computation , doi =
[19]

Deep Reinforcement Learning: A Brief Survey , volume=

Arulkumaran, Kai and Deisenroth, Marc Peter and Brundage, Miles and Bharath, Anil Anthony , year=. Deep Reinforcement Learning: A Brief Survey , volume=. IEEE Signal Processing Magazine , publisher=. doi:10.1109/msp.2017.2743240 , number=

work page doi:10.1109/msp.2017.2743240 2017
[20]

2018 , eprint=

Deep Reinforcement Learning: An Overview , author=. 2018 , eprint=

2018
[21]

and Barto, Andrew G

Sutton, Richard S. and Barto, Andrew G. , title =. 2018 , isbn =

2018
[22]

The American Mathematical Monthly , volume =

Flanders, Harley , title =. The American Mathematical Monthly , volume =. 1973 , publisher =

1973
[23]

The Annals of Probability , volume =

Bin Yu , title =. The Annals of Probability , volume =. 1994 , publisher =

1994
[24]

Vershynin, Roman , title =
[25]

2007 , edition =

Massart, Pascal , title =. 2007 , edition =. doi:10.1007/978-3-540-48503-2 , pages =

work page doi:10.1007/978-3-540-48503-2 2007
[26]

, title =

Fleischmann, K. , title =. Biometrical Journal , volume =. doi:https://doi.org/10.1002/bimj.4710240208 , eprint =

work page doi:10.1002/bimj.4710240208
[27]

Stone , journal =

Charles J. Stone , journal =. Optimal Global Rates of Convergence for Nonparametric Regression , volume =
[28]

2009 , publisher =

Optimal Transport: Old and New , author =. 2009 , publisher =

2009
[29]

Proceedings of the 39th International Conference on Machine Learning , pages =

Analysis of Stochastic Processes through Replay Buffers , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022
[30]

Proceedings of the 39th International Conference on Machine Learning , series =

Analysis of Stochastic Processes through Replay Buffers , author =. Proceedings of the 39th International Conference on Machine Learning , series =. 2022 , address =

2022
[31]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Zhang, Shuai and Li, Hongkang and Wang, Meng and Liu, Miao and Chen, Pin-Yu and Lu, Songtao and Liu, Sijia and Murugesan, Keerthiram and Chaudhury, Subhajit , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[32]

Deep reinforcement learning and the deadly triad

Deep Reinforcement Learning and the Deadly Triad , author =. arXiv preprint arXiv:1812.02648 , year =

work page arXiv
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Deep Reinforcement Learning with Double Q-Learning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[34]

International Conference on Learning Representations , year =

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author =. International Conference on Learning Representations , year =
[35]

Journal of Machine Learning Research , volume =

Mohri, Mehryar and Rostamizadeh, Afshin , title =. Journal of Machine Learning Research , volume =
[36]

Fast Learning from Non-i.i.d

Steinwart, Ingo and Christmann, Andreas , booktitle =. Fast Learning from Non-i.i.d. Observations , volume =
[37]

Rademacher Complexity Bounds for Non-I.I.D

Mohri, Mehryar and Rostamizadeh, Afshin , booktitle =. Rademacher Complexity Bounds for Non-I.I.D. Processes , volume =
[38]

Proceedings of the 36th International Conference on Machine Learning , pages =

Information-Theoretic Considerations in Batch Reinforcement Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

2019
[39]

Operations Research & Management Science in the Age of Analytics , pages =

Daniel Kuhn and Peyman Mohajerin Esfahani and Viet Anh Nguyen and Soroosh Shafieezadeh-Abadeh , title =. Operations Research & Management Science in the Age of Analytics , pages =. 2019 , publisher =

2019
[40]

POT: Python Optimal Transport , journal =

R. POT: Python Optimal Transport , journal =. 2021 , volume =

2021
[41]

International Conference on Machine Learning (ICML) , pages =

Minimax Regret Bounds for Reinforcement Learning , author =. International Conference on Machine Learning (ICML) , pages =
[42]

Proceedings of the 34th International Conference on Machine Learning , pages =

Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

2017
[43]

Is Q-Learning Provably Efficient? , volume =

Jin, Chi and Allen-Zhu, Zeyuan and Bubeck, Sebastien and Jordan, Michael , booktitle =. Is Q-Learning Provably Efficient? , volume =
[44]

Q-learning with

Jin, Chi and Allen-Zhu, Zeyuan and Bubeck, S. Q-learning with. Advances in Neural Information Processing Systems (NeurIPS) , volume =
[45]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Provably efficient reinforcement learning with linear function approximation , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

2020
[46]

Conference on Learning Theory (COLT) , pages =

Provably Efficient Reinforcement Learning with Linear Function Approximation , author =. Conference on Learning Theory (COLT) , pages =
[47]

Annals of Statistics , volume =

On the rate of convergence of fully connected deep neural network regression estimates , author =. Annals of Statistics , volume =. 2021 , publisher =

2021
[48]

Neural Temporal-Difference Learning Converges to Global Optima , volume =

Cai, Qi and Yang, Zhuoran and Lee, Jason D and Wang, Zhaoran , booktitle =. Neural Temporal-Difference Learning Converges to Global Optima , volume =
[49]

Understanding Deep Neural Function Approximation in Reinforcement Learning via -Greedy Exploration , volume =

Liu, Fanghui and Viano, Luca and Cevher, Volkan , booktitle =. Understanding Deep Neural Function Approximation in Reinforcement Learning via -Greedy Exploration , volume =
[50]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Liu, Fanghui and Viano, Luca and Cevher, Volkan , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

2022
[51]

2020 , publisher =

Xu, Pan and Gu, Quanquan , title =. 2020 , publisher =

2020
[52]

Stochastic Systems , volume =

Sirignano, Justin and Spiliopoulos, Konstantinos , title =. Stochastic Systems , volume =. 2022 , doi =

2022
[53]

Stochastic Systems , volume=

Asymptotics of Reinforcement Learning with Neural Networks , author=. Stochastic Systems , volume=. 2022 , doi=

2022
[54]

Can Temporal-Difference and Q-Learning Learn Representation?

Zhang, Yufeng and Cai, Qi and Yang, Zhuoran and Chen, Yongxin and Wang, Zhaoran , booktitle=. Can Temporal-Difference and Q-Learning Learn Representation?
[55]

IEEE Transactions on Automatic Control , volume=

Sample Complexity and Overparameterization Bounds for Temporal-Difference Learning With Neural Network Approximation , author=. IEEE Transactions on Automatic Control , volume=. 2023 , doi=

2023
[56]

International Conference on Learning Representations , year=

On the Performance of Temporal Difference Learning with Neural Networks , author=. International Conference on Learning Representations , year=
[57]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Ke, Zhifa and Wen, Zaiwen and Zhang, Junyu , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[58]

International Conference on Machine Learning , year=

An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks , author=. International Conference on Machine Learning , year=
[59]

On Sample Complexity of Offline Reinforcement Learning with Deep

Nguyen-Tang, Thanh and Gupta, Sunil and Tran-The, Hung and Venkatesh, Svetha , journal=. On Sample Complexity of Offline Reinforcement Learning with Deep
[60]

Mastering atari, go, chess and shogi by planning with a learned model.arXiv preprint arXiv:1911.08265, 2019

Mastering. Nature , author =. 2020 , pages =. doi:10.1038/s41586-020-03051-4 , number =

work page doi:10.1038/s41586-020-03051-4 2020
[61]

Nature , volume=

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model , author=. Nature , volume=. 2020 , doi=

2020
[62]

Nature , volume=

Grandmaster Level in StarCraft II Using Multi-agent Reinforcement Learning , author=. Nature , volume=. 2019 , doi=

2019
[63]

Machine Learning , author =

Learning near-optimal policies with. Machine Learning , author =. 2008 , pages =. doi:10.1007/s10994-007-5038-2 , number =

work page doi:10.1007/s10994-007-5038-2 2008
[64]

Finite-Time Bounds for Fitted Value Iteration , journal =

R. Finite-Time Bounds for Fitted Value Iteration , journal =. 2008 , volume =

2008
[65]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
[66]

ECML PKDD Workshop: Languages for Data Mining and Machine Learning , year =

Lars Buitinck and Gilles Louppe and Mathieu Blondel and Fabian Pedregosa and Andreas Mueller and Olivier Grisel and Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort and Jaques Grobler and Robert Layton and Jake VanderPlas and Arnaud Joly and Brian Holt and Ga. ECML PKDD Workshop: Languages for Data Mining and Machine Learning , year =
[67]

Proceedings of the 39th Conference on Learning Theory , year =

Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension , author =. Proceedings of the 39th Conference on Learning Theory , year =
[68]

Theoretical analysis of deep neural networks for temporally dependent observations , volume =

Ma, Mingliang and Safikhani, Abolfazl , booktitle =. Theoretical analysis of deep neural networks for temporally dependent observations , volume =
[69]

Deep learning for -weakly dependent processes , journal =

William Kengne and Modou Wade , keywords =. Deep learning for -weakly dependent processes , journal =. 2024 , issn =

2024
[70]

Proceedings of the 35th Conference on Learning Theory , series=

Offline Reinforcement Learning with Realizability and Single-policy Concentrability , author=. Proceedings of the 35th Conference on Learning Theory , series=. 2022 , publisher=

2022