pith. sign in

arxiv: 2604.02619 · v1 · submitted 2026-04-03 · 📡 eess.SY · cs.SY

An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games

Pith reviewed 2026-05-13 20:44 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords online learningzero-sum gameslinear quadraticregret analysismodel estimationconfidence setsgeneralized algebraic Riccati equationsurrogate model
0
0 comments X

The pith

An online algorithm learns stabilizing policies for two-player zero-sum linear quadratic games by selecting surrogate models inside high-probability confidence sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an online learning method for two-player zero-sum linear quadratic games with unknown dynamics. It combines regularized least squares model estimation with high-probability confidence sets to bound uncertainty around the estimated system. A shrinkage step at each episode then selects a surrogate model inside that set for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution. This surrogate supports policy updates while preserving stability and enables a regret analysis that bounds the cumulative performance loss. A numerical example illustrates convergence to equilibrium under the derived bounds.

Core claim

By applying a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution, the framework maintains a regular model for policy updates, establishes regret bounds on algorithm convergence, and guarantees stability for two-player zero-sum linear quadratic games with unknown dynamics.

What carries the argument

The shrinkage step that selects a surrogate model inside the high-probability confidence set around the estimated dynamics such that the generalized algebraic Riccati equation yields a stabilizing saddle-point solution.

If this is right

  • Policy updates remain stabilizing at every episode.
  • The algorithm achieves bounded regret relative to the optimal equilibrium.
  • The learned policies converge to the Nash equilibrium of the game.
  • The approach extends model-based learning to adversarial zero-sum settings while preserving closed-loop stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on physical platforms such as competitive robot navigation where dynamics are learned on the fly.
  • Relaxing the surrogate existence assumption might allow broader classes of uncertain dynamics.
  • The confidence-set plus shrinkage pattern may transfer to other robust control problems with saddle-point objectives.

Load-bearing premise

A surrogate model always exists inside the high-probability confidence set for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution, and the shrinkage step can reliably locate it.

What would settle it

A simulation or experiment in which no model inside the constructed confidence set admits a stabilizing saddle-point solution for the generalized algebraic Riccati equation, causing the shrinkage step to fail and the closed-loop system to become unstable.

Figures

Figures reproduced from arXiv: 2604.02619 by Andreas A. Malikopoulos, Shanting Wang, Weihao Sun.

Figure 2
Figure 2. Figure 2: Comparison of θe and θb (left), and regret convergence (right). confidence failure probability δ = 0.2. The initial state is chosen as x0 = [1.2, −0.90, 0.70]⊤. The initial candidate model is selected as a perturbed version of the true parameter matrix and is accepted only if it satisfies the regularity test. A benchmark value in the regret analysis is computed as J⋆ = tr(P⋆Σw) where P⋆ is the Riccati solu… view at source ↗
Figure 1
Figure 1. Figure 1: Theta estimation error (left) and gain error for both players (right). [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

In this paper, we present an online learning approach for two-player zero-sum linear quadratic games with unknown dynamics. We develop a framework combining regularized least squares model estimation, high probability confidence sets, and surrogate model selection to maintain a regular model for policy updates. We apply a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution. We then establish regret analysis on algorithm convergence, followed by a numerical example to illustrate the convergence performance and verify the regret analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops an online learning algorithm for two-player zero-sum linear-quadratic games with unknown dynamics. It combines regularized least-squares estimation, high-probability confidence ellipsoids, and a shrinkage step that selects a surrogate model inside the ellipsoid for which the generalized algebraic Riccati equation admits a stabilizing saddle-point solution. Regret bounds are stated for the resulting policy sequence, and a numerical example is provided to illustrate convergence.

Significance. If the central guarantee that a stabilizing surrogate always exists inside the confidence set holds and the regret analysis is free of circular dependence on post-selection quantities, the work would extend existing single-player adaptive LQR results to the zero-sum game setting and supply a concrete template for maintaining well-posed Riccati solutions under uncertainty. The numerical illustration is consistent with the claimed convergence rate but does not yet constitute independent verification of the high-probability claims.

major comments (2)
  1. [Algorithm description and main theorem] The manuscript assumes without proof that the intersection of the high-probability confidence ellipsoid with the set of models admitting a stabilizing saddle-point solution for the generalized algebraic Riccati equation is nonempty at every episode (see the description of the shrinkage step following the model-estimation phase). No explicit argument or lemma establishes that the shrinkage operator is guaranteed to land inside this intersection on the high-probability event; if the intersection is empty for some realizations, the policy-update step is undefined and the subsequent regret bound does not apply.
  2. [Regret analysis] The regret analysis appears to condition on the event that a valid surrogate model is always selected. It is unclear whether the high-probability bound accounts for the probability that the shrinkage step fails to produce such a model, or whether an additional union bound or restart mechanism is introduced. This conditioning must be made explicit before the regret statement can be considered unconditional.
minor comments (2)
  1. [Preliminaries] Notation for the generalized algebraic Riccati equation and the associated saddle-point solution should be introduced with a self-contained definition before the shrinkage step is described.
  2. [Numerical example] The numerical example would benefit from reporting the empirical frequency with which the shrinkage step succeeds across Monte-Carlo runs, together with the observed regret slope on a log-log scale to allow direct visual comparison with the stated bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. We address the two major comments point by point below. Both points identify gaps in the current exposition that we will close in revision.

read point-by-point responses
  1. Referee: The manuscript assumes without proof that the intersection of the high-probability confidence ellipsoid with the set of models admitting a stabilizing saddle-point solution for the generalized algebraic Riccati equation is nonempty at every episode (see the description of the shrinkage step following the model-estimation phase). No explicit argument or lemma establishes that the shrinkage operator is guaranteed to land inside this intersection on the high-probability event; if the intersection is empty for some realizations, the policy-update step is undefined and the subsequent regret bound does not apply.

    Authors: We agree that the current manuscript does not contain an explicit lemma establishing non-emptiness of the intersection on the high-probability event. In the revision we will insert a new lemma (placed immediately before the algorithm description) that proves the following: on the event that the true dynamics lie inside the confidence ellipsoid (which holds with probability at least 1-δ), the ellipsoid necessarily intersects the set of models for which the GARE admits a stabilizing saddle-point solution. The argument relies on (i) the true model being stabilizing by assumption, (ii) continuity of the stabilizing solution of the GARE with respect to the system matrices (standard in the LQ literature), and (iii) the radius of the ellipsoid shrinking to zero. The shrinkage operator will then be formally defined as any selection rule that returns a point inside this intersection (e.g., a convex combination between the least-squares estimate and the boundary of the ellipsoid). revision: yes

  2. Referee: The regret analysis appears to condition on the event that a valid surrogate model is always selected. It is unclear whether the high-probability bound accounts for the probability that the shrinkage step fails to produce such a model, or whether an additional union bound or restart mechanism is introduced. This conditioning must be made explicit before the regret statement can be considered unconditional.

    Authors: The regret theorem is currently stated under the event that a valid surrogate is selected at every episode. In the revision we will (a) make this conditioning explicit in the theorem statement, (b) invoke the new lemma above to show that the probability of this event is at least 1-δ, and (c) absorb the complementary probability into the final high-probability regret bound via a simple union bound over the finite number of episodes. No restart mechanism is introduced; the failure probability is controlled directly by the choice of δ and the sample-size lower bound already present in the analysis. The revised theorem will therefore read as an unconditional high-probability statement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's framework uses regularized least-squares estimation to build high-probability confidence sets, followed by an explicit shrinkage step to select a surrogate model inside the set where the generalized algebraic Riccati equation has a stabilizing saddle-point solution. The regret bound is then derived from this construction and standard online-learning arguments. No quoted step reduces a claimed prediction or result to a fitted input by definition, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The central modeling assumption is stated outright rather than smuggled in via renaming or self-referential uniqueness. The numerical example serves only as illustration, leaving the analytic chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger cannot be populated with concrete free parameters or invented entities. The approach implicitly relies on standard domain assumptions from LQ game theory (existence of stabilizing saddle points for the GARE) and online learning (bounded noise, stabilizability).

pith-pipeline@v0.9.0 · 5386 in / 1115 out tokens · 52362 ms · 2026-05-13T20:44:01.313431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Distributed nash equilib- rium seeking in games with partial decision information: A survey,

    M. Ye, Q.-L. Han, L. Ding, and S. Xu, “Distributed nash equilib- rium seeking in games with partial decision information: A survey,” Proceedings of the IEEE, vol. 111, no. 2, pp. 140–157, 2023

  2. [2]

    When does selfishness align with team goals? A structural analysis of equilibrium and optimality,

    G. Xu, T. Parisini, and A. A. Malikopoulos, “When does selfishness align with team goals? A structural analysis of equilibrium and optimality,”arXiv preprint arXiv:2508.13450, 2025

  3. [3]

    F. L. Lewis, D. Vrabie, and V . L. Syrmos,Optimal control. John Wiley & Sons, 2012

  4. [4]

    Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,

    K. Ren, G. Salizzoni, M. E. G ¨ursoy, and M. Kamgarpour, “Chance- constrained linear quadratic gaussian games for multi-robot interaction under uncertainty,”IEEE Control Systems Letters, 2025

  5. [5]

    Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,

    V .-A. Le, B. Chalaki, V . Tadiparthi, H. N. Mahjoub, J. D’sa, E. Moradi- Pari, and A. A. Malikopoulos, “Multi-Robot Cooperative Navigation in Crowds: A Game-Theoretic Learning-Based Model Predictive Control Approach,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 4834–4840

  6. [6]

    Potential game-based decision-making for autonomous driving,

    M. Liu, I. Kolmanovsky, H. E. Tseng, S. Huang, D. Filev, and A. Girard, “Potential game-based decision-making for autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 8, pp. 8014–8027, 2023

  7. [7]

    Game- theoretic learning-based mitigation of insider threats,

    G. Xu, K. Chen, T. Parisini, and A. A. Malikopoulos, “Game- theoretic learning-based mitigation of insider threats,”arXiv preprint arXiv:2512.03222, 2025

  8. [8]

    Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,

    I. V . Chremos and A. A. Malikopoulos, “Mechanism design theory in control engineering: A tutorial and overview of applications in communication, power grid, transportation, and security systems,”IEEE Control Systems, vol. 44, no. 1, pp. 20–45, 2024

  9. [9]

    Bas ¸ar and P

    T. Bas ¸ar and P. Bernhard,H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008

  10. [10]

    Separation of learning and control for cyber- physical systems,

    A. A. Malikopoulos, “Separation of learning and control for cyber- physical systems,”Automatica, vol. 151, no. 110912, 2023

  11. [11]

    Combining learning and control in linear systems,

    ——, “Combining learning and control in linear systems,”European Journal of Control, vol. 80, no. Part A, p. 101043, 2024

  12. [12]

    Combined learning and control: A new paradigm for optimal control with unknown dynamics,

    P. Kounatidis and A. A. Malikopoulos, “Combined learning and control: A new paradigm for optimal control with unknown dynamics,” in65th American Control Conference (ACC), 2025, in review

  13. [13]

    P. A. Ioannou and J. Sun,Robust adaptive control. PTR Prentice-Hall Upper Saddle River, NJ, 1996, vol. 1

  14. [14]

    Regret bounds for robust adaptive control of the linear quadratic regulator,

    S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,”Advances in Neural Information Processing Systems, vol. 31, 2018

  15. [15]

    Certainty equivalence is efficient for linear quadratic control,

    H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,”Advances in neural information processing systems, vol. 32, 2019

  16. [16]

    Naive exploration is optimal for online lqr,

    M. Simchowitz and D. Foster, “Naive exploration is optimal for online lqr,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 8937–8948

  17. [17]

    Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,

    K. Zhang, Z. Yang, and T. Basar, “Policy optimization provably converges to nash equilibria in zero-sum linear quadratic games,” Advances in Neural Information Processing Systems, vol. 32, 2019

  18. [18]

    Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,

    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free q-learning designs for linear discrete-time zero-sum games with application to h-infinity control,”Automatica, vol. 43, no. 3, pp. 473–481, 2007

  19. [19]

    Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,

    B. Nortmann, A. Monti, M. Sassano, and T. Mylvaganam, “Nash equilibria for linear quadratic discrete-time dynamic games via iterative and data-driven algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 10, pp. 6561–6575, 2024

  20. [20]

    Improved algorithms for linear stochastic bandits,

    Y . Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,”Advances in neural information processing systems, vol. 24, 2011

  21. [21]

    Regret bounds for the adaptive control of linear quadratic systems,

    Y . Abbasi-Yadkori and C. Szepesv´ari, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. APPENDIX A. Proof of Theorem 1 Proof:The proof proceeds in three steps: (i) we show that DP F(P ⋆, θ⋆) reduces to an invertibl...