An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games

Andreas A. Malikopoulos; Shanting Wang; Weihao Sun

arxiv: 2604.02619 · v1 · submitted 2026-04-03 · 📡 eess.SY · cs.SY

An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games

Shanting Wang , Weihao Sun , Andreas A. Malikopoulos This is my paper

Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords online learningzero-sum gameslinear quadraticregret analysismodel estimationconfidence setsgeneralized algebraic Riccati equationsurrogate model

0 comments

The pith

An online algorithm learns stabilizing policies for two-player zero-sum linear quadratic games by selecting surrogate models inside high-probability confidence sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an online learning method for two-player zero-sum linear quadratic games with unknown dynamics. It combines regularized least squares model estimation with high-probability confidence sets to bound uncertainty around the estimated system. A shrinkage step at each episode then selects a surrogate model inside that set for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution. This surrogate supports policy updates while preserving stability and enables a regret analysis that bounds the cumulative performance loss. A numerical example illustrates convergence to equilibrium under the derived bounds.

Core claim

By applying a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution, the framework maintains a regular model for policy updates, establishes regret bounds on algorithm convergence, and guarantees stability for two-player zero-sum linear quadratic games with unknown dynamics.

What carries the argument

The shrinkage step that selects a surrogate model inside the high-probability confidence set around the estimated dynamics such that the generalized algebraic Riccati equation yields a stabilizing saddle-point solution.

If this is right

Policy updates remain stabilizing at every episode.
The algorithm achieves bounded regret relative to the optimal equilibrium.
The learned policies converge to the Nash equilibrium of the game.
The approach extends model-based learning to adversarial zero-sum settings while preserving closed-loop stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on physical platforms such as competitive robot navigation where dynamics are learned on the fly.
Relaxing the surrogate existence assumption might allow broader classes of uncertain dynamics.
The confidence-set plus shrinkage pattern may transfer to other robust control problems with saddle-point objectives.

Load-bearing premise

A surrogate model always exists inside the high-probability confidence set for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution, and the shrinkage step can reliably locate it.

What would settle it

A simulation or experiment in which no model inside the constructed confidence set admits a stabilizing saddle-point solution for the generalized algebraic Riccati equation, causing the shrinkage step to fail and the closed-loop system to become unstable.

Figures

Figures reproduced from arXiv: 2604.02619 by Andreas A. Malikopoulos, Shanting Wang, Weihao Sun.

**Figure 2.** Figure 2: Comparison of θe and θb (left), and regret convergence (right). confidence failure probability δ = 0.2. The initial state is chosen as x0 = [1.2, −0.90, 0.70]⊤. The initial candidate model is selected as a perturbed version of the true parameter matrix and is accepted only if it satisfies the regularity test. A benchmark value in the regret analysis is computed as J⋆ = tr(P⋆Σw) where P⋆ is the Riccati solu… view at source ↗

**Figure 1.** Figure 1: Theta estimation error (left) and gain error for both players (right). [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

In this paper, we present an online learning approach for two-player zero-sum linear quadratic games with unknown dynamics. We develop a framework combining regularized least squares model estimation, high probability confidence sets, and surrogate model selection to maintain a regular model for policy updates. We apply a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution. We then establish regret analysis on algorithm convergence, followed by a numerical example to illustrate the convergence performance and verify the regret analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete online algorithm for zero-sum LQ games using regularized estimation and shrinkage to a surrogate model, but the guarantee that such a surrogate always exists inside the confidence set is assumed rather than shown.

read the letter

The paper develops an online learning method for two-player zero-sum linear quadratic games with unknown system dynamics. It combines regularized least squares for model estimation, high-probability confidence sets around the estimate, and a shrinkage procedure to select a surrogate model inside the set that admits a stabilizing saddle-point solution to the generalized algebraic Riccati equation. From there it updates policies and derives a regret bound, with a numerical simulation to check performance. The new part is extending the online LQ control ideas to the zero-sum game setting while keeping the model regular enough for the Riccati solution to exist at each step. The shrinkage step is a concrete way to enforce that. The regret analysis and the example are the main deliverables. The work is done carefully on the estimation side and the numerical check looks reasonable for illustration. The regret guarantee is stated explicitly. The main concern is whether the surrogate model is guaranteed to exist inside the confidence set at every episode. The construction assumes the shrinkage always finds one where the GARE has a stabilizing saddle point, but there is no argument showing the intersection is nonempty or that the operator stays inside the high-probability event. If for some noise realizations the set has no such model, the algorithm stops or the bound fails. That needs to be addressed for the result to be complete. Overall this is for control theorists and online learning researchers who care about competitive settings with partial model knowledge. It gives a usable algorithm with theoretical backing, so a reader working on adaptive multi-agent systems would find it worth reading. The paper is coherent enough to go to peer review, though the existence question will likely come up in review. I would recommend sending it out for refereeing.

Referee Report

2 major / 2 minor

Summary. The paper develops an online learning algorithm for two-player zero-sum linear-quadratic games with unknown dynamics. It combines regularized least-squares estimation, high-probability confidence ellipsoids, and a shrinkage step that selects a surrogate model inside the ellipsoid for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution. Regret bounds are stated for the resulting policy sequence, and a numerical example is provided to illustrate convergence.

Significance. If the central guarantee that a stabilizing surrogate always exists inside the confidence set holds and the regret analysis is free of circular dependence on post-selection quantities, the work would extend existing single-player adaptive LQR results to the zero-sum game setting and supply a concrete template for maintaining well-posed Riccati solutions under uncertainty. The numerical illustration is consistent with the claimed convergence rate but does not yet constitute independent verification of the high-probability claims.

major comments (2)

[Algorithm description and main theorem] The manuscript assumes without proof that the intersection of the high-probability confidence ellipsoid with the set of models admitting a stabilizing saddle-point solution for the generalized algebraic Riccati equation is nonempty at every episode (see the description of the shrinkage step following the model-estimation phase). No explicit argument or lemma establishes that the shrinkage operator is guaranteed to land inside this intersection on the high-probability event; if the intersection is empty for some realizations, the policy-update step is undefined and the subsequent regret bound does not apply.
[Regret analysis] The regret analysis appears to condition on the event that a valid surrogate model is always selected. It is unclear whether the high-probability bound accounts for the probability that the shrinkage step fails to produce such a model, or whether an additional union bound or restart mechanism is introduced. This conditioning must be made explicit before the regret statement can be considered unconditional.

minor comments (2)

[Preliminaries] Notation for the generalized algebraic Riccati equation and the associated saddle-point solution should be introduced with a self-contained definition before the shrinkage step is described.
[Numerical example] The numerical example would benefit from reporting the empirical frequency with which the shrinkage step succeeds across Monte-Carlo runs, together with the observed regret slope on a log-log scale to allow direct visual comparison with the stated bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address the two major comments point by point below. Both points identify gaps in the current exposition that we will close in revision.

read point-by-point responses

Referee: The manuscript assumes without proof that the intersection of the high-probability confidence ellipsoid with the set of models admitting a stabilizing saddle-point solution for the generalized algebraic Riccati equation is nonempty at every episode (see the description of the shrinkage step following the model-estimation phase). No explicit argument or lemma establishes that the shrinkage operator is guaranteed to land inside this intersection on the high-probability event; if the intersection is empty for some realizations, the policy-update step is undefined and the subsequent regret bound does not apply.

Authors: We agree that the current manuscript does not contain an explicit lemma establishing non-emptiness of the intersection on the high-probability event. In the revision we will insert a new lemma (placed immediately before the algorithm description) that proves the following: on the event that the true dynamics lie inside the confidence ellipsoid (which holds with probability at least 1-δ), the ellipsoid necessarily intersects the set of models for which the GARE admits a stabilizing saddle-point solution. The argument relies on (i) the true model being stabilizing by assumption, (ii) continuity of the stabilizing solution of the GARE with respect to the system matrices (standard in the LQ literature), and (iii) the radius of the ellipsoid shrinking to zero. The shrinkage operator will then be formally defined as any selection rule that returns a point inside this intersection (e.g., a convex combination between the least-squares estimate and the boundary of the ellipsoid). revision: yes
Referee: The regret analysis appears to condition on the event that a valid surrogate model is always selected. It is unclear whether the high-probability bound accounts for the probability that the shrinkage step fails to produce such a model, or whether an additional union bound or restart mechanism is introduced. This conditioning must be made explicit before the regret statement can be considered unconditional.

Authors: The regret theorem is currently stated under the event that a valid surrogate is selected at every episode. In the revision we will (a) make this conditioning explicit in the theorem statement, (b) invoke the new lemma above to show that the probability of this event is at least 1-δ, and (c) absorb the complementary probability into the final high-probability regret bound via a simple union bound over the finite number of episodes. No restart mechanism is introduced; the failure probability is controlled directly by the choice of δ and the sample-size lower bound already present in the analysis. The revised theorem will therefore read as an unconditional high-probability statement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's framework uses regularized least-squares estimation to build high-probability confidence sets, followed by an explicit shrinkage step to select a surrogate model inside the set where the generalized algebraic Riccati equation has a stabilizing saddle-point solution. The regret bound is then derived from this construction and standard online-learning arguments. No quoted step reduces a claimed prediction or result to a fitted input by definition, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The central modeling assumption is stated outright rather than smuggled in via renaming or self-referential uniqueness. The numerical example serves only as illustration, leaving the analytic chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger cannot be populated with concrete free parameters or invented entities. The approach implicitly relies on standard domain assumptions from LQ game theory (existence of stabilizing saddle points for the GARE) and online learning (bounded noise, stabilizability).

pith-pipeline@v0.9.0 · 5386 in / 1115 out tokens · 52362 ms · 2026-05-13T20:44:01.313431+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Distributed nash equilib- rium seeking in games with partial decision information: A survey,

M. Ye, Q.-L. Han, L. Ding, and S. Xu, “Distributed nash equilib- rium seeking in games with partial decision information: A survey,” Proceedings of the IEEE, vol. 111, no. 2, pp. 140–157, 2023

work page 2023
[2]

When does selfishness align with team goals? A structural analysis of equilibrium and optimality,

G. Xu, T. Parisini, and A. A. Malikopoulos, “When does selfishness align with team goals? A structural analysis of equilibrium and optimality,”arXiv preprint arXiv:2508.13450, 2025

work page arXiv 2025
[3]

F. L. Lewis, D. Vrabie, and V . L. Syrmos,Optimal control. John Wiley & Sons, 2012

work page 2012
[4]

Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,

K. Ren, G. Salizzoni, M. E. G ¨ursoy, and M. Kamgarpour, “Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,”IEEE Control Systems Letters, 2025

work page 2025
[5]

Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,

V .-A. Le, B. Chalaki, V . Tadiparthi, H. N. Mahjoub, J. D’sa, E. Moradi- Pari, and A. A. Malikopoulos, “Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 4834–4840

work page 2024
[6]

Potential game-based decision-making for autonomous driving,

M. Liu, I. Kolmanovsky, H. E. Tseng, S. Huang, D. Filev, and A. Girard, “Potential game-based decision-making for autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 8, pp. 8014–8027, 2023

work page 2023
[7]

Game- theoretic learning-based mitigation of insider threats,

G. Xu, K. Chen, T. Parisini, and A. A. Malikopoulos, “Game- theoretic learning-based mitigation of insider threats,”arXiv preprint arXiv:2512.03222, 2025

work page arXiv 2025
[8]

Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,

I. V . Chremos and A. A. Malikopoulos, “Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,”IEEE Control Systems, vol. 44, no. 1, pp. 20–45, 2024

work page 2024
[9]

Bas ¸ar and P

T. Bas ¸ar and P. Bernhard,H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008

work page 2008
[10]

Separation of learning and control for cyber- physical systems,

A. A. Malikopoulos, “Separation of learning and control for cyber- physical systems,”Automatica, vol. 151, no. 110912, 2023

work page 2023
[11]

Combining learning and control in linear systems,

——, “Combining learning and control in linear systems,”European Journal of Control, vol. 80, no. Part A, p. 101043, 2024

work page 2024
[12]

Combined learning and control: A new paradigm for optimal control with unknown dynamics,

P. Kounatidis and A. A. Malikopoulos, “Combined learning and control: A new paradigm for optimal control with unknown dynamics,” in65th American Control Conference (ACC), 2025, in review

work page 2025
[13]

P. A. Ioannou and J. Sun,Robust adaptive control. PTR Prentice-Hall Upper Saddle River, NJ, 1996, vol. 1

work page 1996
[14]

Regret bounds for robust adaptive control of the linear quadratic regulator,

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018
[15]

Certainty equivalence is efficient for linear quadratic control,

H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[16]

Naive exploration is optimal for online lqr,

M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 8937–8948

work page 2020
[17]

Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,

K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,” Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[18]

Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,

A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,”Automatica, vol. 43, no. 3, pp. 473–481, 2007

work page 2007
[19]

Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,

B. Nortmann, A. Monti, M. Sassano, and T. Mylvaganam, “Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 10, pp. 6561–6575, 2024

work page 2024
[20]

Improved algorithms for linear stochastic bandits,

Y . Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,”Advances in neural information processing systems, vol. 24, 2011

work page 2011
[21]

Regret bounds for the adaptive control of linear quadratic systems,

Y . Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. APPENDIX A. Proof of Theorem 1 Proof:The proof proceeds in three steps: (i) we show that DP F(P ⋆, θ⋆) reduces to an invertibl...

work page 2011

[1] [1]

Distributed nash equilib- rium seeking in games with partial decision information: A survey,

M. Ye, Q.-L. Han, L. Ding, and S. Xu, “Distributed nash equilib- rium seeking in games with partial decision information: A survey,” Proceedings of the IEEE, vol. 111, no. 2, pp. 140–157, 2023

work page 2023

[2] [2]

When does selfishness align with team goals? A structural analysis of equilibrium and optimality,

G. Xu, T. Parisini, and A. A. Malikopoulos, “When does selfishness align with team goals? A structural analysis of equilibrium and optimality,”arXiv preprint arXiv:2508.13450, 2025

work page arXiv 2025

[3] [3]

F. L. Lewis, D. Vrabie, and V . L. Syrmos,Optimal control. John Wiley & Sons, 2012

work page 2012

[4] [4]

Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,

K. Ren, G. Salizzoni, M. E. G ¨ursoy, and M. Kamgarpour, “Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,”IEEE Control Systems Letters, 2025

work page 2025

[5] [5]

Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,

V .-A. Le, B. Chalaki, V . Tadiparthi, H. N. Mahjoub, J. D’sa, E. Moradi- Pari, and A. A. Malikopoulos, “Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 4834–4840

work page 2024

[6] [6]

Potential game-based decision-making for autonomous driving,

M. Liu, I. Kolmanovsky, H. E. Tseng, S. Huang, D. Filev, and A. Girard, “Potential game-based decision-making for autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 8, pp. 8014–8027, 2023

work page 2023

[7] [7]

Game- theoretic learning-based mitigation of insider threats,

G. Xu, K. Chen, T. Parisini, and A. A. Malikopoulos, “Game- theoretic learning-based mitigation of insider threats,”arXiv preprint arXiv:2512.03222, 2025

work page arXiv 2025

[8] [8]

Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,

I. V . Chremos and A. A. Malikopoulos, “Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,”IEEE Control Systems, vol. 44, no. 1, pp. 20–45, 2024

work page 2024

[9] [9]

Bas ¸ar and P

T. Bas ¸ar and P. Bernhard,H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008

work page 2008

[10] [10]

Separation of learning and control for cyber- physical systems,

A. A. Malikopoulos, “Separation of learning and control for cyber- physical systems,”Automatica, vol. 151, no. 110912, 2023

work page 2023

[11] [11]

Combining learning and control in linear systems,

——, “Combining learning and control in linear systems,”European Journal of Control, vol. 80, no. Part A, p. 101043, 2024

work page 2024

[12] [12]

Combined learning and control: A new paradigm for optimal control with unknown dynamics,

P. Kounatidis and A. A. Malikopoulos, “Combined learning and control: A new paradigm for optimal control with unknown dynamics,” in65th American Control Conference (ACC), 2025, in review

work page 2025

[13] [13]

P. A. Ioannou and J. Sun,Robust adaptive control. PTR Prentice-Hall Upper Saddle River, NJ, 1996, vol. 1

work page 1996

[14] [14]

Regret bounds for robust adaptive control of the linear quadratic regulator,

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018

[15] [15]

Certainty equivalence is efficient for linear quadratic control,

H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[16] [16]

Naive exploration is optimal for online lqr,

M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 8937–8948

work page 2020

[17] [17]

Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,

K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,” Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019

[18] [18]

Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,

A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,”Automatica, vol. 43, no. 3, pp. 473–481, 2007

work page 2007

[19] [19]

Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,

B. Nortmann, A. Monti, M. Sassano, and T. Mylvaganam, “Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 10, pp. 6561–6575, 2024

work page 2024

[20] [20]

Improved algorithms for linear stochastic bandits,

Y . Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,”Advances in neural information processing systems, vol. 24, 2011

work page 2011

[21] [21]

Regret bounds for the adaptive control of linear quadratic systems,

Y . Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. APPENDIX A. Proof of Theorem 1 Proof:The proof proceeds in three steps: (i) we show that DP F(P ⋆, θ⋆) reduces to an invertibl...

work page 2011