Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise

Ariel Neufeld; Kyunghyun Park; Mathieu Lauri\`ere

arxiv: 2606.20356 · v1 · pith:TUUW3H6Mnew · submitted 2026-06-18 · 🧮 math.OC · cs.AI· cs.LG· math.PR· stat.ML

Robust Q-learning for mean-field control under Wasserstein uncertainty in common noise

Mathieu Lauri\`ere , Ariel Neufeld , Kyunghyun Park This is my paper

Pith reviewed 2026-06-26 15:56 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGmath.PRstat.ML

keywords robust Q-learningmean-field controlWasserstein uncertaintycommon noisequantization-projectionconvergence boundssystemic riskepidemic models

0 comments

The pith

A robust Q-learning algorithm converges for discrete-time mean-field control problems when the common noise law lies in a known Wasserstein ball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a robust Q-learning method for mean-field control where many agents interact and share an uncertain common noise whose distribution belongs to a Wasserstein ball around a nominal law. It combines a quantization-and-projection scheme that discretizes the state-action space with a dual reformulation that turns the worst-case expectation over the uncertainty set into a tractable optimization. The authors prove that both the synchronous and asynchronous versions of the algorithm converge to the optimal robust Q-function and supply explicit finite-time iteration bounds. This matters for applications such as systemic risk and epidemic control because it lets a learner find policies that remain effective even when the true common noise differs from the assumed law within the ball. Numerical tests illustrate the robustness-performance tradeoff and the observed convergence speed of the asynchronous scheme.

Core claim

The central claim is that a robust Q-learning algorithm solves discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm uses a quantization-and-projection scheme on the state-action space together with a Wasserstein dual reformulation on the common-noise space. Convergence and finite-time bounds are proved for both synchronous and asynchronous updates.

What carries the argument

The robust Q-learning algorithm that combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space to keep the robust Bellman operator a contraction.

If this is right

Both synchronous and asynchronous schemes converge to the optimal robust Q-function.
Explicit finite-time iteration bounds hold for the learning process.
The method applies to systemic risk and epidemic models under common-noise misspecification.
A robustness-performance tradeoff appears when the radius of the Wasserstein ball is varied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantization-plus-dual approach could be tested on mean-field games rather than control if the fixed-point structure is preserved.
If the Wasserstein radius must be estimated from data, the finite-time bounds would need an extra error term for radius estimation.
The discretization error from quantization could be traded against sample complexity in high-dimensional state spaces.

Load-bearing premise

The common-noise distribution belongs to a Wasserstein ball of known radius around a nominal law, and the quantization-projection step produces a sufficiently accurate finite approximation so that the robust Bellman operator remains a contraction.

What would settle it

Run the asynchronous Q-learning scheme on a low-dimensional mean-field control test problem with a known optimal robust value and check whether the iterates reach the claimed finite-time error bound before the predicted number of steps.

Figures

Figures reproduced from arXiv: 2606.20356 by Ariel Neufeld, Kyunghyun Park, Mathieu Lauri\`ere.

**Figure 1.** Figure 1: Asynchronous Q-function convergence. Error ET (m) against the idealized finite-grid fixed point for selected robustness radii. 3.2. Systemic Risk. This example is a stylized finite-state model of a population of financial institutions whose capital levels are affected by individual controls and aggregate shocks. The individual state, action, and common-noise spaces are Ssys = {0, 1, 2}, A = {−1, 0, 1}, E0 … view at source ↗

**Figure 2.** Figure 2: Systemic Risk robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show same-grid idealized references. Moderate robustness substantially improves performance under adverse common shocks [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: SIS robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show idealized references. The algorithm closely tracks the finite-grid idealized benchmark. 3.3.2. SEIR. The individual state space is SSEIR = {S,E,I, R}, where the two new states are interpreted as Exposed and Recovered. In the Exposed state, the agent is not yet infectious. The commo… view at source ↗

**Figure 4.** Figure 4: SEIR robustness profile. Solid curves show reward means from the asynchronous implementation of Algorithm 1; dashed curves show idealized references. Even at the high discount β = 0.9, the robust Q-learning algorithm matches the finite-grid idealized benchmark [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

In this article, we present a robust $Q$-learning algorithm for discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space. We establish its convergence together with finite-time iteration bounds for both synchronous and asynchronous learning schemes. Numerical experiments on systemic risk and epidemic models compare the asynchronous implementation with an idealized Bellman iteration, illustrate the robustness-performance tradeoff under common-noise misspecification, and report the observed convergence behavior of the asynchronous $Q$-learning algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete robust Q-learning algorithm for mean-field control with Wasserstein uncertainty on common noise by pairing quantization-projection with a dual reformulation, but the key open question is whether the approximation step keeps the robust operator contractive.

read the letter

The core of this paper is a robust Q-learning algorithm for discrete-time mean-field control problems where the common noise law sits inside a Wasserstein ball of known radius around a nominal distribution. They use a quantization-and-projection scheme on the state-action space together with a dual reformulation of the Wasserstein distance on the common-noise component, and they prove convergence with finite-time iteration bounds for both the synchronous and asynchronous cases.

The new element is the specific combination inside the Q-learning loop for problems with common noise uncertainty. Prior work on robust mean-field control or on Q-learning for mean-field games exists, but this pairing of quantization with the dual reformulation for the uncertainty set appears fresh. The numerical experiments on systemic risk and epidemic models are useful; they compare the asynchronous learner against an idealized Bellman iteration and show how performance changes under misspecification of the common noise.

The soft spot is the handling of the approximation error. The convergence argument requires that the quantized robust Bellman operator remains a contraction. The abstract says the quantization-projection produces a sufficiently accurate finite approximation, but it does not supply visible quantitative control on how the mesh size interacts with the discount factor and the radius of the Wasserstein ball. If that error pushes the contraction modulus above one, both the convergence claim and the finite-time bounds would fail. This is the link that needs the tightest checking in the full proofs.

Overall the paper targets people working on robust control in large-population systems with shared noise, such as in finance or epidemiology. The algorithmic idea is concrete enough that a reader in that area could try to implement it. I think it deserves a serious referee because the problem setting is practical and the method is spelled out, even though the error analysis will probably require some work.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a robust Q-learning algorithm for discrete-time mean-field control under Wasserstein uncertainty in the common noise law. The method combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space, and claims to establish convergence together with finite-time iteration bounds for both synchronous and asynchronous schemes. Numerical experiments on systemic risk and epidemic models are used to illustrate robustness-performance tradeoffs and observed convergence behavior.

Significance. If the claimed convergence and finite-time bounds hold after accounting for approximation errors, the work would supply a practical, theoretically supported approach to robust mean-field control under distributional uncertainty in common noise, with direct relevance to applications such as systemic risk and epidemic modeling.

major comments (2)

[§3.2 and Theorem 4.1] §3.2 (Quantization-and-projection step) and Theorem 4.1: The central claim that the approximated robust Bellman operator remains a contraction (with modulus strictly less than 1) after quantization and projection is load-bearing for both convergence and the finite-time bounds. No explicit quantitative relation is given between quantization mesh size, discount factor, and Wasserstein radius that guarantees the contraction constant stays below 1 uniformly over the uncertainty ball.
[Theorem 4.2] Asynchronous scheme (Theorem 4.2): The error recursion used to derive the finite-time bound does not visibly incorporate the additional projection error term arising from the quantization step under the Wasserstein uncertainty; without this control, the stated iteration complexity may fail to hold for positive radius values.

minor comments (2)

[Preliminaries] The description of the Wasserstein dual reformulation in the preliminaries would benefit from an explicit statement of the dual variables and their dependence on the common-noise space.
[Numerical experiments] Figure captions for the numerical experiments should specify the number of independent runs and any variability measures used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below and indicate the revisions that will be made to the manuscript.

read point-by-point responses

Referee: [§3.2 and Theorem 4.1] §3.2 (Quantization-and-projection step) and Theorem 4.1: The central claim that the approximated robust Bellman operator remains a contraction (with modulus strictly less than 1) after quantization and projection is load-bearing for both convergence and the finite-time bounds. No explicit quantitative relation is given between quantization mesh size, discount factor, and Wasserstein radius that guarantees the contraction constant stays below 1 uniformly over the uncertainty ball.

Authors: The referee is correct that the manuscript does not supply an explicit quantitative threshold relating mesh size δ, discount factor γ, and radius ρ that keeps the contraction modulus strictly below 1 uniformly over the ball. The proof of Theorem 4.1 establishes the contraction for sufficiently fine quantization via continuity of the Wasserstein dual, but leaves the dependence implicit. In the revised version we will add a remark after Theorem 4.1 that derives the explicit condition δ < (1-γ-ε(ρ))/L (with L the Lipschitz constant of the running cost) guaranteeing the modulus ≤ γ+ε(ρ)<1. revision: yes
Referee: [Theorem 4.2] Asynchronous scheme (Theorem 4.2): The error recursion used to derive the finite-time bound does not visibly incorporate the additional projection error term arising from the quantization step under the Wasserstein uncertainty; without this control, the stated iteration complexity may fail to hold for positive radius values.

Authors: We agree that the error recursion in the proof of Theorem 4.2 does not explicitly include the additional projection error that arises from the quantization step when the uncertainty radius is positive. This term is bounded by a multiple of the mesh size but must be carried through the recursion. In the revised manuscript we will augment the recursion with this O(δ) term and derive the corresponding adjusted iteration complexity that remains valid for ρ>0. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard contraction and duality arguments

full rationale

The paper's central claims concern convergence and finite-time bounds for a robust Q-learning algorithm that combines quantization-projection with a Wasserstein dual reformulation of the common-noise uncertainty. These rest on the robust Bellman operator being a contraction (under the stated Wasserstein-ball assumption and sufficiently accurate quantization) together with standard fixed-point and stochastic-approximation arguments. No quoted equation or step reduces a claimed prediction or uniqueness result to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the assumptions and duality are treated as external mathematical facts rather than constructed from the algorithm's outputs. This is the normal non-circular case for a convergence proof in approximate dynamic programming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central construction rests on the modeling choice that uncertainty is captured by a Wasserstein ball and that the quantization step preserves the contraction property of the robust Bellman operator.

axioms (2)

domain assumption The common noise law lies inside a Wasserstein ball of finite radius around a known nominal distribution.
This defines the robust objective and is invoked to justify the dual reformulation.
domain assumption The quantization-and-projection operator yields a sufficiently accurate finite-state approximation for the robust value iteration to remain contractive.
Required for the finite-time bounds to hold in the discrete setting.

pith-pipeline@v0.9.1-grok · 5639 in / 1358 out tokens · 24743 ms · 2026-06-26T15:56:57.955417+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Anahtarci, C

B. Anahtarci, C. D. Kariksiz, and N. Saldi.Q-learning in regularized mean-field games.Dyn. Games Appl., 13(1):89–117, 2023

2023
[2]

Angiuli, J.-P

A. Angiuli, J.-P. Fouque, M. Laurière, and M. Zhang. Analysis of multiscale reinforcementQ-learning algo- rithms for mean field control games.Appl. Math. Optim., 93(2):27, 2026

2026
[3]

K. Azuma. Weighted sums of certain dependent random variables.Tohoku Math. J., Second Series, 19(3):357– 367, 1967

1967
[4]

Bartl, S

D. Bartl, S. Drapeau, and L. Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing.Math. Finance, 30(1):287–309, 2020

2020
[5]

Bäuerle and A

N. Bäuerle and A. Glauner. Distributionally robust Markov decision processes and their connection to risk measures.Math. Oper. Res., 47(3):1757–1780, 2022

2022
[6]

Bäuerle and U

N. Bäuerle and U. Rieder.Markov decision processes with applications to finance. Springer Science & Business Media, 2011

2011
[7]

Bauso, H

D. Bauso, H. Tembine, and T. Basar. Opinion dynamics in social networks through mean-field games.SIAM J. Control Optim., 54(6):3225–3257, 2016

2016
[8]

Bauso, H

D. Bauso, H. Tembine, and T. Başar. Robust Mean Field Games.Dynam. Games Appl., 6(3):277–303, 2016

2016
[9]

C. L. Beck and R. Srikant. Error bounds for constant step-sizeQ-learning.Syst. Control Lett., 61(12):1203– 1208, 2012

2012
[10]

Bensoussan, J

A. Bensoussan, J. Frehse, and P. Yam.Mean field games and mean field type control theory, volume 101. New York: Springer-Verlag, 2013

2013
[11]

D. P. Bertsekas. Neuro-dynamic programming. InEncyclopedia of optimization, pages 1–6. Springer, 2025

2025
[12]

Blanchet, M

J. Blanchet, M. Lu, T. Zhang, and H. Zhong. Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage.Adv. Neural Inf. Process. Syst., 36:66845–66859, 2023

2023
[13]

Blanchet and K

J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport.Math. Oper. Res., 44(2):565–600, 2019

2019
[14]

Carmona and F

R. Carmona and F. Delarue.Probabilistic theory of mean field games with applications I-II. Springer, 2018

2018
[15]

Carmona, K

R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Policy optimization for linear-quadratic zero-sum Mean-Field Type Games. In2020 59th IEEE Conference on Decision and Control (CDC), pages 1038–1043. IEEE, 2020

2020
[16]

Carmona, K

R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Linear-quadratic zero-sum Mean-Field Type Games: Optimality conditions and policy optimization.J. Dyn. Games, 8(4), 2021

2021
[17]

Carmona, M

R. Carmona, M. Laurière, and Z. Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-fieldQ-learning.Ann. Appl. Probab., 33(6B):5334–5381, 2023

2023
[18]

Cui and H

K. Cui and H. Koeppl. Approximately solving Mean Field Games via entropy-regularized deep reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021

1909
[19]

M. F. Djete. Extended mean field control problem: A propagation of chaos result.Electron. J. Probab., 27:1–53, 2022

2022
[20]

M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: limit theory and equivalence between different formulations.Math. Oper. Res., 47(4):2891–2930, 2022

2022
[21]

M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: The dynamic programming principle. Ann. Probab., 50(2):791–833, 2022

2022
[22]

Dvoretsky.On stochastic approximation

A. Dvoretsky.On stochastic approximation. Mathematics Division, Office of Scientific Research, US Air Force, 1955

1955
[23]

Elamvazhuthi and S

K. Elamvazhuthi and S. Berman. Mean-field models in swarm robotics: A survey.Bioinsp. Biomim., 15(1):015001, 2019. 41

2019
[24]

R. Elie, J. Pérolat, M. Laurière, M. Geist, and O. Pietquin. On the convergence of model free learning in Mean Field Games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020

2020
[25]

Even-Dar and Y

E. Even-Dar and Y. Mansour. Learning rates forQ-learning.J. Mach. Learn. Res., 5(Dec):1–25, 2003

2003
[26]

Firoozi and S

D. Firoozi and S. Jaimungal. Exploratory LQG mean field games with entropy regularization.Automatica, 139:110177, 2022

2022
[27]

Fouque, R

J.-P. Fouque, R. Carmona, and L. Sun. Mean field games and systemic risk.Commun. Math. Sci, 13(4):911– 933, 2015

2015
[28]

Frikha, M

N. Frikha, M. Germain, M. Laurière, H. Pham, and X. Song. Actor-critic learning for mean-field control in continuous time.J. Mach. Learn. Res., 26(127):1–42, 2025

2025
[29]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. InICML, pages 2052–2062. PMLR, 2019

2052
[30]

Gao and A

R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance.Math. Oper. Res., 48(2):603–655, 2023

2023
[31]

Gast and B

N. Gast and B. Gaujal. A mean field approach for optimization in discrete time.Discrete Event Dyn. Syst., 21(1):63–101, 2011

2011
[32]

N. Gast, B. Gaujal, and J.-Y. Le Boudec. Mean field for Markov decision processes: from discrete to continuous optimization.IEEE. Trans. Autom. Control, 57(9):2266–2280, 2012

2012
[33]

H. Gu, X. Guo, X. Wei, and R. Xu. Mean-field controls withQ-learning for cooperative MARL: convergence and complexity analysis.SIAM J. Math. Data Sci., 3(4):1168–1196, 2021

2021
[34]

H. Gu, X. Guo, X. Wei, and R. Xu. Dynamic programming principles for mean-field controls with learning. Oper. Res., 71(4):1040–1054, 2023

2023
[35]

X. Guo, A. Hu, R. Xu, and J. Zhang. Learning Mean-Field Games. InAdv. Neural Inf. Process. Syst., volume 32, 2019

2019
[36]

Hu and M

R. Hu and M. Laurière. Recent developments in machine learning methods for stochastic control and games. arXiv preprint arXiv:2303.10257, 2023

work page arXiv 2023
[37]

Huang and M

J. Huang and M. Huang. Robust mean field linear-quadratic-gaussian games with unknownL2-disturbance. SIAM J. Control Optim., 55(5):2811–2840, 2017

2017
[38]

Huang, B.-C

J. Huang, B.-C. Wang, and J. Yong. Social optima in mean field linear-quadratic-Gaussian control with volatility uncertainty.SIAM J. Control Optim., 59(2):825–856, 2021

2021
[39]

Huang, R

M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: Closed loop McKean–Vlasov sysyems and the Nash certainity equivalence principle.Commun. Inf. Syst., 6(3):221–252, 2006

2006
[40]

Jaakkola, M

T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms.Neural Comput., 6(6):1185–1201, 1994

1994
[41]

Jeloka, Y

B. Jeloka, Y. Guan, and P. Tsiotras. Learning large-scale competitive team behaviors with Mean-Field inter- actions. InThe Seventeenth Workshop on Adaptive and Learning Agents, 2025

2025
[42]

Kearns and S

M. Kearns and S. Singh. Finite-sample convergence rates forQ-learning and indirect algorithms.Adv. Neural Inf. Process. Syst., 11, 1998

1998
[43]

Klenke.Probability Theory: A Comprehensive Course

A. Klenke.Probability Theory: A Comprehensive Course. Universitext. Springer, London, 2nd ed., 2014

2014
[44]

Lange, T

S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. InReinforcement learning: State-of- the-art, pages 45–73. Springer, 2012

2012
[45]

Langner, A

J. Langner, A. Neufeld, and K. Park. Markov-Nash equilibria in mean-field games under model uncertainty. preprint, arXiv:2410.11652, 2024

work page arXiv 2024
[46]

Lasry and P.-L

J.-M. Lasry and P.-L. Lions. Mean field games.Japan. J. Math., 2(1):229–260, 2007

2007
[47]

Laurière

M. Laurière. Numerical methods for Mean Field Games and Mean Field Type Control. InProceedings of Symposia in Applied Mathematics, volume 78, pages 221–282. American Mathematical Society, 2021

2021
[48]

Laurière, A

M. Laurière, A. Neufeld, and K. Park. Robust mean-field control under common noise uncertainty.arXiv preprint arXiv:2511.04515, 2025

work page arXiv 2025
[49]

Laurière, S

M. Laurière, S. Perrin, S. Girgin, P. Muller, A. Jain, T. Cabannes, G. Piliouras, J. Pérolat, R. Elie, O. Pietquin, et al. Scalable deep reinforcement learning algorithms for Mean Field Games. InICML, pages 12078–12095. PMLR, 2022

2022
[50]

Laurière, S

M. Laurière, S. Perrin, J. Pérolat, S. Girgin, P. Muller, R. Elie, M. Geist, and O. Pietquin. Learning in Mean Field Games: A survey.arXiv preprint arXiv:2205.12944, 2022

work page arXiv 2022
[51]

Laurière and O

M. Laurière and O. Pironneau. Dynamic programming for mean-field type control.J. Optim. Theory Appl., 169(3):902–924, 2016. 42 MATHIEU LAURIÈRE, ARIEL NEUFELD, AND KYUNGHYUN PARK

2016
[52]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[53]

G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei. Settling the sample complexity of model-based offline reinforcement learning.Ann. Stat., 52(1):233–260, 2024

2024
[54]

G. Li, Y. Wei, Y. Chi, and Y. Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model.Oper. Res., 72(1):203–221, 2024

2024
[55]

G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen. Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Adv. Neural Inf. Process. Syst., 33:7031–7043, 2020

2020
[56]

M. Li, D. Kuhn, and T. Sutter. Policy gradient algorithms for robust MDP with nonrectangular uncertainty sets.SIAM J. Optim., 36(1):120–151, 2026

2026
[57]

Liang, B.-C

Y. Liang, B.-C. Wang, and H. Zhang. Robust mean field linear quadratic social control: Open-loop and closed-loop strategies.SIAM J. Control Optim., 60(4):2184–2213, 2022

2022
[58]

Liang, Z

Z. Liang, Z. Zhou, Y. Zhuang, and B. Zou. Mean-field games under model uncertainty.arXiv preprint arXiv:2601.12226, 2026

work page arXiv 2026
[59]

M. L. Littman and C. Szepesvári. A generalized reinforcement-learning model: Convergence and applications. InICML, volume 96, pages 310–318, 1996

1996
[60]

Z. Liu, Q. Bai, J. Blanchet, P. Dong, W. Xu, Z. Zhou, and Z. Zhou. Distributionally robustQ-learning. In ICML, pages 13623–13643. PMLR, 2022

2022
[61]

C. I. Lu, J. Sester, and A. Zhang. Distributionally robust deepQ-learning.preprint, arXiv:2505.19058, 2025

work page arXiv 2025
[62]

Mohajerin Esfahani and D

P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations.Math. Program., 171(1):115–166, 2018

2018
[63]

Moon and T

J. Moon and T. Başar. Robust Mean Field Games for coupled Markov jump linear systems.Internat. J. Control, 89(7):1367–1381, 2016

2016
[64]

Motte and H

M. Motte and H. Pham. Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab., 32(2):1421–1458, 2022

2022
[65]

Motte and H

M. Motte and H. Pham. Quantitative propagation of chaos for mean field Markov decision process with common noise.Electron. J. Probab., 28:1–24, 2023

2023
[66]

Neufeld and J

A. Neufeld and J. Sester. RobustQ-learning algorithm for Markov decision processes under Wasserstein un- certainty.Automatica, 168:111825, 2024

2024
[67]

Panaganti, Z

K. Panaganti, Z. Xu, D. Kalathil, and M. Ghavamzadeh. Robust reinforcement learning using offline data. Adv. Neural Inf. Process. Syst., 35:32211–32224, 2022

2022
[68]

Pham and X

H. Pham and X. Wei. Discrete time McKean–Vlasov control problem: A dynamic programming approach. Appl. Math. Optim., 74(3):487–506, 2016

2016
[69]

Pham and X

H. Pham and X. Wei. Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101, 2017

2017
[70]

Qu and A

G. Qu and A. Wierman. Finite-time analysis of asynchronous stochastic approximation andQ-learning. In COLT, pages 3185–3205. PMLR, 2020

2020
[71]

Rashidinejad, B

P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Trans. Inf. Theory, 68(12):8156–8196, 2022

2022
[72]

Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part I: Theoretical foundations.arXiv preprint arXiv:2604.27372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[73]

Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part II:q-learning algorithms.arXiv preprint arXiv:2604.27378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[74]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method.Ann. Math. Stat., pages 400–407, 1951

1951
[75]

A. Roy, H. Xu, and S. Pokutta. Reinforcement learning under model mismatch.Adv. Neural Inf. Process. Syst., 30, 2017

2017
[76]

Sanjari and S

S. Sanjari and S. Yüksel. Optimal solutions to infinite-player stochastic teams and mean-field teams.IEEE Trans. Autom. Control, 66(3):1071–1086, 2020

2020
[77]

Sester and C

J. Sester and C. Decker.Q-learning under finite model uncertainty.arXiv e-prints, pages arXiv–2407, 2024

2024
[78]

K. Shao, J. Shen, and M. Laurière. Reinforcement learning for finite space Mean-Field Type Game. InRein- forcement Learning Conference, 2025

2025
[79]

Subramanian and A

J. Subramanian and A. Mahajan. Reinforcement learning in stationary Mean-Field Games. InProceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pages 251–259, 2019

2019
[80]

Szepesvári

C. Szepesvári. The asymptotic convergence-rate ofQ-learning.Adv. Neural Inf. Process. Syst., 10, 1997

1997

Showing first 80 references.

[1] [1]

Anahtarci, C

B. Anahtarci, C. D. Kariksiz, and N. Saldi.Q-learning in regularized mean-field games.Dyn. Games Appl., 13(1):89–117, 2023

2023

[2] [2]

Angiuli, J.-P

A. Angiuli, J.-P. Fouque, M. Laurière, and M. Zhang. Analysis of multiscale reinforcementQ-learning algo- rithms for mean field control games.Appl. Math. Optim., 93(2):27, 2026

2026

[3] [3]

K. Azuma. Weighted sums of certain dependent random variables.Tohoku Math. J., Second Series, 19(3):357– 367, 1967

1967

[4] [4]

Bartl, S

D. Bartl, S. Drapeau, and L. Tangpi. Computational aspects of robust optimized certainty equivalents and option pricing.Math. Finance, 30(1):287–309, 2020

2020

[5] [5]

Bäuerle and A

N. Bäuerle and A. Glauner. Distributionally robust Markov decision processes and their connection to risk measures.Math. Oper. Res., 47(3):1757–1780, 2022

2022

[6] [6]

Bäuerle and U

N. Bäuerle and U. Rieder.Markov decision processes with applications to finance. Springer Science & Business Media, 2011

2011

[7] [7]

Bauso, H

D. Bauso, H. Tembine, and T. Basar. Opinion dynamics in social networks through mean-field games.SIAM J. Control Optim., 54(6):3225–3257, 2016

2016

[8] [8]

Bauso, H

D. Bauso, H. Tembine, and T. Başar. Robust Mean Field Games.Dynam. Games Appl., 6(3):277–303, 2016

2016

[9] [9]

C. L. Beck and R. Srikant. Error bounds for constant step-sizeQ-learning.Syst. Control Lett., 61(12):1203– 1208, 2012

2012

[10] [10]

Bensoussan, J

A. Bensoussan, J. Frehse, and P. Yam.Mean field games and mean field type control theory, volume 101. New York: Springer-Verlag, 2013

2013

[11] [11]

D. P. Bertsekas. Neuro-dynamic programming. InEncyclopedia of optimization, pages 1–6. Springer, 2025

2025

[12] [12]

Blanchet, M

J. Blanchet, M. Lu, T. Zhang, and H. Zhong. Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage.Adv. Neural Inf. Process. Syst., 36:66845–66859, 2023

2023

[13] [13]

Blanchet and K

J. Blanchet and K. Murthy. Quantifying distributional model risk via optimal transport.Math. Oper. Res., 44(2):565–600, 2019

2019

[14] [14]

Carmona and F

R. Carmona and F. Delarue.Probabilistic theory of mean field games with applications I-II. Springer, 2018

2018

[15] [15]

Carmona, K

R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Policy optimization for linear-quadratic zero-sum Mean-Field Type Games. In2020 59th IEEE Conference on Decision and Control (CDC), pages 1038–1043. IEEE, 2020

2020

[16] [16]

Carmona, K

R. Carmona, K. Hamidouche, M. Laurière, and Z. Tan. Linear-quadratic zero-sum Mean-Field Type Games: Optimality conditions and policy optimization.J. Dyn. Games, 8(4), 2021

2021

[17] [17]

Carmona, M

R. Carmona, M. Laurière, and Z. Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-fieldQ-learning.Ann. Appl. Probab., 33(6B):5334–5381, 2023

2023

[18] [18]

Cui and H

K. Cui and H. Koeppl. Approximately solving Mean Field Games via entropy-regularized deep reinforcement learning. InInternational Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021

1909

[19] [19]

M. F. Djete. Extended mean field control problem: A propagation of chaos result.Electron. J. Probab., 27:1–53, 2022

2022

[20] [20]

M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: limit theory and equivalence between different formulations.Math. Oper. Res., 47(4):2891–2930, 2022

2022

[21] [21]

M. F. Djete, D. Possamaï, and X. Tan. McKean–Vlasov optimal control: The dynamic programming principle. Ann. Probab., 50(2):791–833, 2022

2022

[22] [22]

Dvoretsky.On stochastic approximation

A. Dvoretsky.On stochastic approximation. Mathematics Division, Office of Scientific Research, US Air Force, 1955

1955

[23] [23]

Elamvazhuthi and S

K. Elamvazhuthi and S. Berman. Mean-field models in swarm robotics: A survey.Bioinsp. Biomim., 15(1):015001, 2019. 41

2019

[24] [24]

R. Elie, J. Pérolat, M. Laurière, M. Geist, and O. Pietquin. On the convergence of model free learning in Mean Field Games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020

2020

[25] [25]

Even-Dar and Y

E. Even-Dar and Y. Mansour. Learning rates forQ-learning.J. Mach. Learn. Res., 5(Dec):1–25, 2003

2003

[26] [26]

Firoozi and S

D. Firoozi and S. Jaimungal. Exploratory LQG mean field games with entropy regularization.Automatica, 139:110177, 2022

2022

[27] [27]

Fouque, R

J.-P. Fouque, R. Carmona, and L. Sun. Mean field games and systemic risk.Commun. Math. Sci, 13(4):911– 933, 2015

2015

[28] [28]

Frikha, M

N. Frikha, M. Germain, M. Laurière, H. Pham, and X. Song. Actor-critic learning for mean-field control in continuous time.J. Mach. Learn. Res., 26(127):1–42, 2025

2025

[29] [29]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. InICML, pages 2052–2062. PMLR, 2019

2052

[30] [30]

Gao and A

R. Gao and A. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance.Math. Oper. Res., 48(2):603–655, 2023

2023

[31] [31]

Gast and B

N. Gast and B. Gaujal. A mean field approach for optimization in discrete time.Discrete Event Dyn. Syst., 21(1):63–101, 2011

2011

[32] [32]

N. Gast, B. Gaujal, and J.-Y. Le Boudec. Mean field for Markov decision processes: from discrete to continuous optimization.IEEE. Trans. Autom. Control, 57(9):2266–2280, 2012

2012

[33] [33]

H. Gu, X. Guo, X. Wei, and R. Xu. Mean-field controls withQ-learning for cooperative MARL: convergence and complexity analysis.SIAM J. Math. Data Sci., 3(4):1168–1196, 2021

2021

[34] [34]

H. Gu, X. Guo, X. Wei, and R. Xu. Dynamic programming principles for mean-field controls with learning. Oper. Res., 71(4):1040–1054, 2023

2023

[35] [35]

X. Guo, A. Hu, R. Xu, and J. Zhang. Learning Mean-Field Games. InAdv. Neural Inf. Process. Syst., volume 32, 2019

2019

[36] [36]

Hu and M

R. Hu and M. Laurière. Recent developments in machine learning methods for stochastic control and games. arXiv preprint arXiv:2303.10257, 2023

work page arXiv 2023

[37] [37]

Huang and M

J. Huang and M. Huang. Robust mean field linear-quadratic-gaussian games with unknownL2-disturbance. SIAM J. Control Optim., 55(5):2811–2840, 2017

2017

[38] [38]

Huang, B.-C

J. Huang, B.-C. Wang, and J. Yong. Social optima in mean field linear-quadratic-Gaussian control with volatility uncertainty.SIAM J. Control Optim., 59(2):825–856, 2021

2021

[39] [39]

Huang, R

M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games: Closed loop McKean–Vlasov sysyems and the Nash certainity equivalence principle.Commun. Inf. Syst., 6(3):221–252, 2006

2006

[40] [40]

Jaakkola, M

T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms.Neural Comput., 6(6):1185–1201, 1994

1994

[41] [41]

Jeloka, Y

B. Jeloka, Y. Guan, and P. Tsiotras. Learning large-scale competitive team behaviors with Mean-Field inter- actions. InThe Seventeenth Workshop on Adaptive and Learning Agents, 2025

2025

[42] [42]

Kearns and S

M. Kearns and S. Singh. Finite-sample convergence rates forQ-learning and indirect algorithms.Adv. Neural Inf. Process. Syst., 11, 1998

1998

[43] [43]

Klenke.Probability Theory: A Comprehensive Course

A. Klenke.Probability Theory: A Comprehensive Course. Universitext. Springer, London, 2nd ed., 2014

2014

[44] [44]

Lange, T

S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. InReinforcement learning: State-of- the-art, pages 45–73. Springer, 2012

2012

[45] [45]

Langner, A

J. Langner, A. Neufeld, and K. Park. Markov-Nash equilibria in mean-field games under model uncertainty. preprint, arXiv:2410.11652, 2024

work page arXiv 2024

[46] [46]

Lasry and P.-L

J.-M. Lasry and P.-L. Lions. Mean field games.Japan. J. Math., 2(1):229–260, 2007

2007

[47] [47]

Laurière

M. Laurière. Numerical methods for Mean Field Games and Mean Field Type Control. InProceedings of Symposia in Applied Mathematics, volume 78, pages 221–282. American Mathematical Society, 2021

2021

[48] [48]

Laurière, A

M. Laurière, A. Neufeld, and K. Park. Robust mean-field control under common noise uncertainty.arXiv preprint arXiv:2511.04515, 2025

work page arXiv 2025

[49] [49]

Laurière, S

M. Laurière, S. Perrin, S. Girgin, P. Muller, A. Jain, T. Cabannes, G. Piliouras, J. Pérolat, R. Elie, O. Pietquin, et al. Scalable deep reinforcement learning algorithms for Mean Field Games. InICML, pages 12078–12095. PMLR, 2022

2022

[50] [50]

Laurière, S

M. Laurière, S. Perrin, J. Pérolat, S. Girgin, P. Muller, R. Elie, M. Geist, and O. Pietquin. Learning in Mean Field Games: A survey.arXiv preprint arXiv:2205.12944, 2022

work page arXiv 2022

[51] [51]

Laurière and O

M. Laurière and O. Pironneau. Dynamic programming for mean-field type control.J. Optim. Theory Appl., 169(3):902–924, 2016. 42 MATHIEU LAURIÈRE, ARIEL NEUFELD, AND KYUNGHYUN PARK

2016

[52] [52]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[53] [53]

G. Li, L. Shi, Y. Chen, Y. Chi, and Y. Wei. Settling the sample complexity of model-based offline reinforcement learning.Ann. Stat., 52(1):233–260, 2024

2024

[54] [54]

G. Li, Y. Wei, Y. Chi, and Y. Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model.Oper. Res., 72(1):203–221, 2024

2024

[55] [55]

G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen. Sample complexity of asynchronousQ-learning: Sharper analysis and variance reduction.Adv. Neural Inf. Process. Syst., 33:7031–7043, 2020

2020

[56] [56]

M. Li, D. Kuhn, and T. Sutter. Policy gradient algorithms for robust MDP with nonrectangular uncertainty sets.SIAM J. Optim., 36(1):120–151, 2026

2026

[57] [57]

Liang, B.-C

Y. Liang, B.-C. Wang, and H. Zhang. Robust mean field linear quadratic social control: Open-loop and closed-loop strategies.SIAM J. Control Optim., 60(4):2184–2213, 2022

2022

[58] [58]

Liang, Z

Z. Liang, Z. Zhou, Y. Zhuang, and B. Zou. Mean-field games under model uncertainty.arXiv preprint arXiv:2601.12226, 2026

work page arXiv 2026

[59] [59]

M. L. Littman and C. Szepesvári. A generalized reinforcement-learning model: Convergence and applications. InICML, volume 96, pages 310–318, 1996

1996

[60] [60]

Z. Liu, Q. Bai, J. Blanchet, P. Dong, W. Xu, Z. Zhou, and Z. Zhou. Distributionally robustQ-learning. In ICML, pages 13623–13643. PMLR, 2022

2022

[61] [61]

C. I. Lu, J. Sester, and A. Zhang. Distributionally robust deepQ-learning.preprint, arXiv:2505.19058, 2025

work page arXiv 2025

[62] [62]

Mohajerin Esfahani and D

P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations.Math. Program., 171(1):115–166, 2018

2018

[63] [63]

Moon and T

J. Moon and T. Başar. Robust Mean Field Games for coupled Markov jump linear systems.Internat. J. Control, 89(7):1367–1381, 2016

2016

[64] [64]

Motte and H

M. Motte and H. Pham. Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab., 32(2):1421–1458, 2022

2022

[65] [65]

Motte and H

M. Motte and H. Pham. Quantitative propagation of chaos for mean field Markov decision process with common noise.Electron. J. Probab., 28:1–24, 2023

2023

[66] [66]

Neufeld and J

A. Neufeld and J. Sester. RobustQ-learning algorithm for Markov decision processes under Wasserstein un- certainty.Automatica, 168:111825, 2024

2024

[67] [67]

Panaganti, Z

K. Panaganti, Z. Xu, D. Kalathil, and M. Ghavamzadeh. Robust reinforcement learning using offline data. Adv. Neural Inf. Process. Syst., 35:32211–32224, 2022

2022

[68] [68]

Pham and X

H. Pham and X. Wei. Discrete time McKean–Vlasov control problem: A dynamic programming approach. Appl. Math. Optim., 74(3):487–506, 2016

2016

[69] [69]

Pham and X

H. Pham and X. Wei. Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101, 2017

2017

[70] [70]

Qu and A

G. Qu and A. Wierman. Finite-time analysis of asynchronous stochastic approximation andQ-learning. In COLT, pages 3185–3205. PMLR, 2020

2020

[71] [71]

Rashidinejad, B

P. Rashidinejad, B. Zhu, C. Ma, J. Jiao, and S. Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism.IEEE Trans. Inf. Theory, 68(12):8156–8196, 2022

2022

[72] [72]

Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part I: Theoretical foundations.arXiv preprint arXiv:2604.27372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[73] [73]

Z. Ren, X. Wei, X. Yu, and X. Y. Zhou. Continuous-timeq-learning for mean-field control with common noise, Part II:q-learning algorithms.arXiv preprint arXiv:2604.27378, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[74] [74]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method.Ann. Math. Stat., pages 400–407, 1951

1951

[75] [75]

A. Roy, H. Xu, and S. Pokutta. Reinforcement learning under model mismatch.Adv. Neural Inf. Process. Syst., 30, 2017

2017

[76] [76]

Sanjari and S

S. Sanjari and S. Yüksel. Optimal solutions to infinite-player stochastic teams and mean-field teams.IEEE Trans. Autom. Control, 66(3):1071–1086, 2020

2020

[77] [77]

Sester and C

J. Sester and C. Decker.Q-learning under finite model uncertainty.arXiv e-prints, pages arXiv–2407, 2024

2024

[78] [78]

K. Shao, J. Shen, and M. Laurière. Reinforcement learning for finite space Mean-Field Type Game. InRein- forcement Learning Conference, 2025

2025

[79] [79]

Subramanian and A

J. Subramanian and A. Mahajan. Reinforcement learning in stationary Mean-Field Games. InProceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pages 251–259, 2019

2019

[80] [80]

Szepesvári

C. Szepesvári. The asymptotic convergence-rate ofQ-learning.Adv. Neural Inf. Process. Syst., 10, 1997

1997