pith. sign in

arxiv: 2606.20356 · v1 · pith:TUUW3H6Mnew · submitted 2026-06-18 · 🧮 math.OC · cs.AI· cs.LG· math.PR· stat.ML

Robust Q-learning for mean-field control under Wasserstein uncertainty in common noise

Pith reviewed 2026-06-26 15:56 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGmath.PRstat.ML
keywords robust Q-learningmean-field controlWasserstein uncertaintycommon noisequantization-projectionconvergence boundssystemic riskepidemic models
0
0 comments X

The pith

A robust Q-learning algorithm converges for discrete-time mean-field control problems when the common noise law lies in a known Wasserstein ball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a robust Q-learning method for mean-field control where many agents interact and share an uncertain common noise whose distribution belongs to a Wasserstein ball around a nominal law. It combines a quantization-and-projection scheme that discretizes the state-action space with a dual reformulation that turns the worst-case expectation over the uncertainty set into a tractable optimization. The authors prove that both the synchronous and asynchronous versions of the algorithm converge to the optimal robust Q-function and supply explicit finite-time iteration bounds. This matters for applications such as systemic risk and epidemic control because it lets a learner find policies that remain effective even when the true common noise differs from the assumed law within the ball. Numerical tests illustrate the robustness-performance tradeoff and the observed convergence speed of the asynchronous scheme.

Core claim

The central claim is that a robust Q-learning algorithm solves discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm uses a quantization-and-projection scheme on the state-action space together with a Wasserstein dual reformulation on the common-noise space. Convergence and finite-time bounds are proved for both synchronous and asynchronous updates.

What carries the argument

The robust Q-learning algorithm that combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space to keep the robust Bellman operator a contraction.

If this is right

  • Both synchronous and asynchronous schemes converge to the optimal robust Q-function.
  • Explicit finite-time iteration bounds hold for the learning process.
  • The method applies to systemic risk and epidemic models under common-noise misspecification.
  • A robustness-performance tradeoff appears when the radius of the Wasserstein ball is varied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quantization-plus-dual approach could be tested on mean-field games rather than control if the fixed-point structure is preserved.
  • If the Wasserstein radius must be estimated from data, the finite-time bounds would need an extra error term for radius estimation.
  • The discretization error from quantization could be traded against sample complexity in high-dimensional state spaces.

Load-bearing premise

The common-noise distribution belongs to a Wasserstein ball of known radius around a nominal law, and the quantization-projection step produces a sufficiently accurate finite approximation so that the robust Bellman operator remains a contraction.

What would settle it

Run the asynchronous Q-learning scheme on a low-dimensional mean-field control test problem with a known optimal robust value and check whether the iterates reach the claimed finite-time error bound before the predicted number of steps.

Figures

Figures reproduced from arXiv: 2606.20356 by Ariel Neufeld, Kyunghyun Park, Mathieu Lauri\`ere.

Figure 1
Figure 1. Figure 1: Asynchronous Q-function convergence. Error ET (m) against the idealized finite-grid fixed point for selected robustness radii. 3.2. Systemic Risk. This example is a stylized finite-state model of a population of financial institutions whose capital levels are affected by individual controls and aggregate shocks. The individual state, action, and common-noise spaces are Ssys = {0, 1, 2}, A = {−1, 0, 1}, E0 … view at source ↗
Figure 2
Figure 2. Figure 2: Systemic Risk robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show same-grid idealized references. Moderate robustness substantially improves performance under adverse common shocks [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: SIS robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show idealized references. The algorithm closely tracks the finite-grid idealized benchmark. 3.3.2. SEIR. The individual state space is SSEIR = {S,E,I, R}, where the two new states are interpreted as Exposed and Recovered. In the Exposed state, the agent is not yet infectious. The commo… view at source ↗
Figure 4
Figure 4. Figure 4: SEIR robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show idealized references. Even at the high discount β = 0.9, the robust Q-learning algorithm matches the finite-grid idealized benchmark [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

In this article, we present a robust $Q$-learning algorithm for discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space. We establish its convergence together with finite-time iteration bounds for both synchronous and asynchronous learning schemes. Numerical experiments on systemic risk and epidemic models compare the asynchronous implementation with an idealized Bellman iteration, illustrate the robustness-performance tradeoff under common-noise misspecification, and report the observed convergence behavior of the asynchronous $Q$-learning algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a robust Q-learning algorithm for discrete-time mean-field control under Wasserstein uncertainty in the common noise law. The method combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space, and claims to establish convergence together with finite-time iteration bounds for both synchronous and asynchronous schemes. Numerical experiments on systemic risk and epidemic models are used to illustrate robustness-performance tradeoffs and observed convergence behavior.

Significance. If the claimed convergence and finite-time bounds hold after accounting for approximation errors, the work would supply a practical, theoretically supported approach to robust mean-field control under distributional uncertainty in common noise, with direct relevance to applications such as systemic risk and epidemic modeling.

major comments (2)
  1. [§3.2 and Theorem 4.1] §3.2 (Quantization-and-projection step) and Theorem 4.1: The central claim that the approximated robust Bellman operator remains a contraction (with modulus strictly less than 1) after quantization and projection is load-bearing for both convergence and the finite-time bounds. No explicit quantitative relation is given between quantization mesh size, discount factor, and Wasserstein radius that guarantees the contraction constant stays below 1 uniformly over the uncertainty ball.
  2. [Theorem 4.2] Asynchronous scheme (Theorem 4.2): The error recursion used to derive the finite-time bound does not visibly incorporate the additional projection error term arising from the quantization step under the Wasserstein uncertainty; without this control, the stated iteration complexity may fail to hold for positive radius values.
minor comments (2)
  1. [Preliminaries] The description of the Wasserstein dual reformulation in the preliminaries would benefit from an explicit statement of the dual variables and their dependence on the common-noise space.
  2. [Numerical experiments] Figure captions for the numerical experiments should specify the number of independent runs and any variability measures used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below and indicate the revisions that will be made to the manuscript.

read point-by-point responses
  1. Referee: [§3.2 and Theorem 4.1] §3.2 (Quantization-and-projection step) and Theorem 4.1: The central claim that the approximated robust Bellman operator remains a contraction (with modulus strictly less than 1) after quantization and projection is load-bearing for both convergence and the finite-time bounds. No explicit quantitative relation is given between quantization mesh size, discount factor, and Wasserstein radius that guarantees the contraction constant stays below 1 uniformly over the uncertainty ball.

    Authors: The referee is correct that the manuscript does not supply an explicit quantitative threshold relating mesh size δ, discount factor γ, and radius ρ that keeps the contraction modulus strictly below 1 uniformly over the ball. The proof of Theorem 4.1 establishes the contraction for sufficiently fine quantization via continuity of the Wasserstein dual, but leaves the dependence implicit. In the revised version we will add a remark after Theorem 4.1 that derives the explicit condition δ < (1-γ-ε(ρ))/L (with L the Lipschitz constant of the running cost) guaranteeing the modulus ≤ γ+ε(ρ)<1. revision: yes

  2. Referee: [Theorem 4.2] Asynchronous scheme (Theorem 4.2): The error recursion used to derive the finite-time bound does not visibly incorporate the additional projection error term arising from the quantization step under the Wasserstein uncertainty; without this control, the stated iteration complexity may fail to hold for positive radius values.

    Authors: We agree that the error recursion in the proof of Theorem 4.2 does not explicitly include the additional projection error that arises from the quantization step when the uncertainty radius is positive. This term is bounded by a multiple of the mesh size but must be carried through the recursion. In the revised manuscript we will augment the recursion with this O(δ) term and derive the corresponding adjusted iteration complexity that remains valid for ρ>0. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard contraction and duality arguments

full rationale

The paper's central claims concern convergence and finite-time bounds for a robust Q-learning algorithm that combines quantization-projection with a Wasserstein dual reformulation of the common-noise uncertainty. These rest on the robust Bellman operator being a contraction (under the stated Wasserstein-ball assumption and sufficiently accurate quantization) together with standard fixed-point and stochastic-approximation arguments. No quoted equation or step reduces a claimed prediction or uniqueness result to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the assumptions and duality are treated as external mathematical facts rather than constructed from the algorithm's outputs. This is the normal non-circular case for a convergence proof in approximate dynamic programming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central construction rests on the modeling choice that uncertainty is captured by a Wasserstein ball and that the quantization step preserves the contraction property of the robust Bellman operator.

axioms (2)
  • domain assumption The common noise law lies inside a Wasserstein ball of finite radius around a known nominal distribution.
    This defines the robust objective and is invoked to justify the dual reformulation.
  • domain assumption The quantization-and-projection operator yields a sufficiently accurate finite-state approximation for the robust value iteration to remain contractive.
    Required for the finite-time bounds to hold in the discrete setting.

pith-pipeline@v0.9.1-grok · 5639 in / 1358 out tokens · 24743 ms · 2026-06-26T15:56:57.955417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Anahtarci, C

    B. Anahtarci, C. D. Kariksiz, and N. Saldi.Q-learning in regularized mean-field games.Dyn. Games Appl., 13(1):89–117, 2023

  2. [2]

    Angiuli, J.-P

    A. Angiuli, J.-P. Fouque, M. Laurière, and M. Zhang. Analysis of multiscale reinforcementQ-learning algo- rithms for mean field control games.Appl. Math. Optim., 93(2):27, 2026

  3. [3]

    K. Azuma. Weighted sums of certain dependent random variables.Tohoku Math. J., Second Series, 19(3):357– 367, 1967

  4. [4]

    Bartl, S

    D. Bartl, S. Drapeau, and L. Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing.Math. Finance, 30(1):287–309, 2020

  5. [5]

    Bäuerle and A

    N. Bäuerle and A. Glauner. Distributionally robust Markov decision processes and their connection to risk measures.Math. Oper. Res., 47(3):1757–1780, 2022

  6. [6]

    Bäuerle and U

    N. Bäuerle and U. Rieder.Markov decision processes with applications to finance. Springer Science & Business Media, 2011

  7. [7]

    Bauso, H

    D. Bauso, H. Tembine, and T. Basar. Opinion dynamics in social networks through mean-field games.SIAM J. Control Optim., 54(6):3225–3257, 2016

  8. [8]

    Bauso, H

    D. Bauso, H. Tembine, and T. Başar. Robust Mean Field Games.Dynam. Games Appl., 6(3):277–303, 2016

  9. [9]

    C. L. Beck and R. Srikant. Error bounds for constant step-sizeQ-learning.Syst. Control Lett., 61(12):1203– 1208, 2012

  10. [10]

    Bensoussan, J

    A. Bensoussan, J. Frehse, and P. Yam.Mean field games and mean field type control theory, volume 101. New York: Springer-Verlag, 2013

  11. [11]

    D. P. Bertsekas. Neuro-dynamic programming. InEncyclopedia of optimization, pages 1–6. Springer, 2025

  12. [12]

    Blanchet, M

    J. Blanchet, M. Lu, T. Zhang, and H. Zhong. Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage.Adv. Neural Inf. Process. Syst., 36:66845–66859, 2023

  13. [13]

    Blanchet and K

    J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport.Math. Oper. Res., 44(2):565–600, 2019

  14. [14]

    Carmona and F

    R. Carmona and F. Delarue.Probabilistic theory of mean field games with applications I-II. Springer, 2018

  15. [15]

    Carmona, K

    R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Policy optimization for linear-quadratic zero-sum Mean-Field Type Games. In2020 59th IEEE Conference on Decision and Control (CDC), pages 1038–1043. IEEE, 2020

  16. [16]

    Carmona, K

    R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Linear-quadratic zero-sum Mean-Field Type Games: Optimality conditions and policy optimization.J. Dyn. Games, 8(4), 2021

  17. [17]

    Carmona, M

    R. Carmona, M. Laurière, and Z. Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-fieldQ-learning.Ann. Appl. Probab., 33(6B):5334–5381, 2023

  18. [18]

    Cui and H

    K. Cui and H. Koeppl. Approximately solving Mean Field Games via entropy-regularized deep reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021

  19. [19]

    M. F. Djete. Extended mean field control problem: A propagation of chaos result.Electron. J. Probab., 27:1–53, 2022

  20. [20]

    M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: limit theory and equivalence between different formulations.Math. Oper. Res., 47(4):2891–2930, 2022

  21. [21]

    M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: The dynamic programming principle. Ann. Probab., 50(2):791–833, 2022

  22. [22]

    Dvoretsky.On stochastic approximation

    A. Dvoretsky.On stochastic approximation. Mathematics Division, Office of Scientific Research, US Air Force, 1955

  23. [23]

    Elamvazhuthi and S

    K. Elamvazhuthi and S. Berman. Mean-field models in swarm robotics: A survey.Bioinsp. Biomim., 15(1):015001, 2019. 41

  24. [24]

    R. Elie, J. Pérolat, M. Laurière, M. Geist, and O. Pietquin. On the convergence of model free learning in Mean Field Games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020

  25. [25]

    Even-Dar and Y

    E. Even-Dar and Y. Mansour. Learning rates forQ-learning.J. Mach. Learn. Res., 5(Dec):1–25, 2003

  26. [26]

    Firoozi and S

    D. Firoozi and S. Jaimungal. Exploratory LQG mean field games with entropy regularization.Automatica, 139:110177, 2022

  27. [27]

    Fouque, R

    J.-P. Fouque, R. Carmona, and L. Sun. Mean field games and systemic risk.Commun. Math. Sci, 13(4):911– 933, 2015

  28. [28]

    Frikha, M

    N. Frikha, M. Germain, M. Laurière, H. Pham, and X. Song. Actor-critic learning for mean-field control in continuous time.J. Mach. Learn. Res., 26(127):1–42, 2025

  29. [29]

    Fujimoto, D

    S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. InICML, pages 2052–2062. PMLR, 2019

  30. [30]

    Gao and A

    R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance.Math. Oper. Res., 48(2):603–655, 2023

  31. [31]

    Gast and B

    N. Gast and B. Gaujal. A mean field approach for optimization in discrete time.Discrete Event Dyn. Syst., 21(1):63–101, 2011

  32. [32]

    N. Gast, B. Gaujal, and J.-Y. Le Boudec. Mean field for Markov decision processes: from discrete to continuous optimization.IEEE. Trans. Autom. Control, 57(9):2266–2280, 2012

  33. [33]

    H. Gu, X. Guo, X. Wei, and R. Xu. Mean-field controls withQ-learning for cooperative MARL: convergence and complexity analysis.SIAM J. Math. Data Sci., 3(4):1168–1196, 2021

  34. [34]

    H. Gu, X. Guo, X. Wei, and R. Xu. Dynamic programming principles for mean-field controls with learning. Oper. Res., 71(4):1040–1054, 2023

  35. [35]

    X. Guo, A. Hu, R. Xu, and J. Zhang. Learning Mean-Field Games. InAdv. Neural Inf. Process. Syst., volume 32, 2019

  36. [36]

    Hu and M

    R. Hu and M. Laurière. Recent developments in machine learning methods for stochastic control and games. arXiv preprint arXiv:2303.10257, 2023

  37. [37]

    Huang and M

    J. Huang and M. Huang. Robust mean field linear-quadratic-gaussian games with unknownL2-disturbance. SIAM J. Control Optim., 55(5):2811–2840, 2017

  38. [38]

    Huang, B.-C

    J. Huang, B.-C. Wang, and J. Yong. Social optima in mean field linear-quadratic-Gaussian control with volatility uncertainty.SIAM J. Control Optim., 59(2):825–856, 2021

  39. [39]

    Huang, R

    M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: Closed loop McKean–Vlasov sysyems and the Nash certainity equivalence principle.Commun. Inf. Syst., 6(3):221–252, 2006

  40. [40]

    Jaakkola, M

    T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms.Neural Comput., 6(6):1185–1201, 1994

  41. [41]

    Jeloka, Y

    B. Jeloka, Y. Guan, and P. Tsiotras. Learning large-scale competitive team behaviors with Mean-Field inter- actions. InThe Seventeenth Workshop on Adaptive and Learning Agents, 2025

  42. [42]

    Kearns and S

    M. Kearns and S. Singh. Finite-sample convergence rates forQ-learning and indirect algorithms.Adv. Neural Inf. Process. Syst., 11, 1998

  43. [43]

    Klenke.Probability Theory: A Comprehensive Course

    A. Klenke.Probability Theory: A Comprehensive Course. Universitext. Springer, London, 2nd ed., 2014

  44. [44]

    Lange, T

    S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. InReinforcement learning: State-of- the-art, pages 45–73. Springer, 2012

  45. [45]

    Langner, A

    J. Langner, A. Neufeld, and K. Park. Markov-Nash equilibria in mean-field games under model uncertainty. preprint, arXiv:2410.11652, 2024

  46. [46]

    Lasry and P.-L

    J.-M. Lasry and P.-L. Lions. Mean field games.Japan. J. Math., 2(1):229–260, 2007

  47. [47]

    Laurière

    M. Laurière. Numerical methods for Mean Field Games and Mean Field Type Control. InProceedings of Symposia in Applied Mathematics, volume 78, pages 221–282. American Mathematical Society, 2021

  48. [48]

    Laurière, A

    M. Laurière, A. Neufeld, and K. Park. Robust mean-field control under common noise uncertainty.arXiv preprint arXiv:2511.04515, 2025

  49. [49]

    Laurière, S

    M. Laurière, S. Perrin, S. Girgin, P. Muller, A. Jain, T. Cabannes, G. Piliouras, J. Pérolat, R. Elie, O. Pietquin, et al. Scalable deep reinforcement learning algorithms for Mean Field Games. InICML, pages 12078–12095. PMLR, 2022

  50. [50]

    Laurière, S

    M. Laurière, S. Perrin, J. Pérolat, S. Girgin, P. Muller, R. Elie, M. Geist, and O. Pietquin. Learning in Mean Field Games: A survey.arXiv preprint arXiv:2205.12944, 2022

  51. [51]

    Laurière and O

    M. Laurière and O. Pironneau. Dynamic programming for mean-field type control.J. Optim. Theory Appl., 169(3):902–924, 2016. 42 MATHIEU LAURIÈRE, ARIEL NEUFELD, AND KYUNGHYUN PARK

  52. [52]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  53. [53]

    G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei. Settling the sample complexity of model-based offline reinforcement learning.Ann. Stat., 52(1):233–260, 2024

  54. [54]

    G. Li, Y. Wei, Y. Chi, and Y. Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model.Oper. Res., 72(1):203–221, 2024

  55. [55]

    G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen. Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Adv. Neural Inf. Process. Syst., 33:7031–7043, 2020

  56. [56]

    M. Li, D. Kuhn, and T. Sutter. Policy gradient algorithms for robust MDP with nonrectangular uncertainty sets.SIAM J. Optim., 36(1):120–151, 2026

  57. [57]

    Liang, B.-C

    Y. Liang, B.-C. Wang, and H. Zhang. Robust mean field linear quadratic social control: Open-loop and closed-loop strategies.SIAM J. Control Optim., 60(4):2184–2213, 2022

  58. [58]

    Liang, Z

    Z. Liang, Z. Zhou, Y. Zhuang, and B. Zou. Mean-field games under model uncertainty.arXiv preprint arXiv:2601.12226, 2026

  59. [59]

    M. L. Littman and C. Szepesvári. A generalized reinforcement-learning model: Convergence and applications. InICML, volume 96, pages 310–318, 1996

  60. [60]

    Z. Liu, Q. Bai, J. Blanchet, P. Dong, W. Xu, Z. Zhou, and Z. Zhou. Distributionally robustQ-learning. In ICML, pages 13623–13643. PMLR, 2022

  61. [61]

    C. I. Lu, J. Sester, and A. Zhang. Distributionally robust deepQ-learning.preprint, arXiv:2505.19058, 2025

  62. [62]

    Mohajerin Esfahani and D

    P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations.Math. Program., 171(1):115–166, 2018

  63. [63]

    Moon and T

    J. Moon and T. Başar. Robust Mean Field Games for coupled Markov jump linear systems.Internat. J. Control, 89(7):1367–1381, 2016

  64. [64]

    Motte and H

    M. Motte and H. Pham. Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab., 32(2):1421–1458, 2022

  65. [65]

    Motte and H

    M. Motte and H. Pham. Quantitative propagation of chaos for mean field Markov decision process with common noise.Electron. J. Probab., 28:1–24, 2023

  66. [66]

    Neufeld and J

    A. Neufeld and J. Sester. RobustQ-learning algorithm for Markov decision processes under Wasserstein un- certainty.Automatica, 168:111825, 2024

  67. [67]

    Panaganti, Z

    K. Panaganti, Z. Xu, D. Kalathil, and M. Ghavamzadeh. Robust reinforcement learning using offline data. Adv. Neural Inf. Process. Syst., 35:32211–32224, 2022

  68. [68]

    Pham and X

    H. Pham and X. Wei. Discrete time McKean–Vlasov control problem: A dynamic programming approach. Appl. Math. Optim., 74(3):487–506, 2016

  69. [69]

    Pham and X

    H. Pham and X. Wei. Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101, 2017

  70. [70]

    Qu and A

    G. Qu and A. Wierman. Finite-time analysis of asynchronous stochastic approximation andQ-learning. In COLT, pages 3185–3205. PMLR, 2020

  71. [71]

    Rashidinejad, B

    P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Trans. Inf. Theory, 68(12):8156–8196, 2022

  72. [72]

    Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part I: Theoretical foundations.arXiv preprint arXiv:2604.27372, 2026

  73. [73]

    Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part II:q-learning algorithms.arXiv preprint arXiv:2604.27378, 2026

  74. [74]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method.Ann. Math. Stat., pages 400–407, 1951

  75. [75]

    A. Roy, H. Xu, and S. Pokutta. Reinforcement learning under model mismatch.Adv. Neural Inf. Process. Syst., 30, 2017

  76. [76]

    Sanjari and S

    S. Sanjari and S. Yüksel. Optimal solutions to infinite-player stochastic teams and mean-field teams.IEEE Trans. Autom. Control, 66(3):1071–1086, 2020

  77. [77]

    Sester and C

    J. Sester and C. Decker.Q-learning under finite model uncertainty.arXiv e-prints, pages arXiv–2407, 2024

  78. [78]

    K. Shao, J. Shen, and M. Laurière. Reinforcement learning for finite space Mean-Field Type Game. InRein- forcement Learning Conference, 2025

  79. [79]

    Subramanian and A

    J. Subramanian and A. Mahajan. Reinforcement learning in stationary Mean-Field Games. InProceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pages 251–259, 2019

  80. [80]

    Szepesvári

    C. Szepesvári. The asymptotic convergence-rate ofQ-learning.Adv. Neural Inf. Process. Syst., 10, 1997

Showing first 80 references.