arxiv: 2604.06738 · v2 · submitted 2026-04-08 · 💻 cs.GT · cs.LG

Recognition: no theorem link

Beyond Pessimism: Offline Learning in KL-regularized Games

Yuheng Zhang , Claire Chen , Nan Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:54 UTC · model grok-4.3

classification 💻 cs.GT cs.LG

keywords offline learningKL-regularized gameszero-sum gamespessimism-freesample complexityNash equilibriumself-play

0 comments

The pith

A new algorithm achieves the first pessimism-free offline learning guarantee for KL-regularized zero-sum games at O(1/n) rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines offline learning where two players optimize policies in a zero-sum game while staying close to a fixed reference policy via KL regularization. Earlier approaches relied on pessimistic value estimates to manage cases where offline data misses some behaviors, which produced slower convergence of order one over square root of n. The authors introduce an analytical framework that uses the smoothness of the KL-regularized best responses together with a stability property of the induced Nash equilibrium arising from skew symmetry. These properties let the method bound distribution shift directly, eliminating the need for pessimism and delivering an improved rate of order one over n. They also supply a practical self-play procedure that approximates equilibrium finding through repeated policy updates while retaining the same statistical guarantee up to a small optimization error.

Core claim

We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first pessimism-free offline learning guarantee for KL-regularized games, with a fast O(1/n) sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.

What carries the argument

Smoothness of KL-regularized best responses combined with the stability property of the Nash equilibrium induced by skew symmetry, which together control distribution shift directly without pessimistic estimates.

If this is right

The algorithm provides the first pessimism-free offline learning guarantee for KL-regularized games.
It attains a fast O(1/n) sample complexity bound.
An efficient self-play procedure preserves the statistical guarantee up to controlled optimization error.
Pessimistic value estimation is unnecessary in this regularized game setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regularization may sometimes substitute for pessimism when handling distribution shift in other offline multi-agent problems.
The same smoothness and stability arguments could be tested for faster rates in regularized variants of cooperative or general-sum games.
Practitioners might obtain more efficient training from logged game data by adopting KL regularization and this approach.

Load-bearing premise

The smoothness of KL-regularized best responses and the stability property of the Nash equilibrium induced by skew symmetry must hold to control distribution shift without pessimism.

What would settle it

An empirical test on KL-regularized game data where the observed convergence rate stays at O(1/sqrt(n)) or where policies degrade without added pessimism would disprove the claim.

read the original abstract

We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gets a clean pessimism-free O(1/n) rate for offline KL-regularized zero-sum games by exploiting smoothness of the regularized best responses plus stability from skew symmetry.

read the letter

This paper gets a clean pessimism-free O(1/n) rate for offline KL-regularized zero-sum games by exploiting smoothness of the regularized best responses plus stability from skew symmetry. That's the main thing to know up front. They avoid the pessimistic value estimates that prior work uses to handle distribution shift, which normally limits you to 1/sqrt(n) rates. Instead they leverage the fact that KL regularization makes best responses smooth and that skew symmetry in the zero-sum payoff gives the regularized Nash some stability property that directly controls the shift effect. They also give an iterative self-play algorithm that replaces exact equilibrium solves with repeated KL-regularized updates and still preserves the statistical guarantee up to optimization error. That practical piece is a plus. The approach is new relative to the pessimistic baselines they cite and does a solid job of using the specific structure of the regularized game to reach the faster rate. The soft spot is whether the stability argument really delivers global control without any hidden coverage. Smoothness by itself usually only gives Lipschitz propagation, which would still produce the slower rate once estimation error enters. For the claimed 1/n bound to hold without pessimism, the stability needs to supply a stronger contraction that cancels the linear shift term. If that contraction is only local or requires the offline data to already put reasonable mass near the equilibrium, then the guarantee is not fully free of implicit safeguards. The abstract states the proofs exist, but the details on how global the argument is would need checking. This is for people working on offline multi-agent RL or regularized game settings who care about sample efficiency. A reader focused on faster rates in structured games would get value from both the rate improvement and the self-play method. It deserves a serious referee because the claim is sharper than existing results and the framework is distinct, even if the stability step needs close verification. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a pessimism-free offline learning algorithm for KL-regularized two-player zero-sum games. It relies on the smoothness of KL-regularized best-response maps together with a stability property of the regularized Nash equilibrium induced by skew symmetry of the payoff to obtain an O(1/n) sample-complexity guarantee without pessimistic value corrections. The work also introduces an efficient self-play procedure that replaces exact equilibrium computation with iterative KL-regularized updates and proves that the last iterate retains the same statistical guarantee up to a controlled optimization error.

Significance. If the central claims hold, the result would be significant: it supplies the first pessimism-free offline guarantee for this class of games and improves the statistical rate from the typical O(1/sqrt(n)) to O(1/n). The self-play algorithm adds practical value by avoiding exact equilibrium solves. The paper states that proofs exist for the key bounds, which is a positive feature, but the fast rate hinges on the stability property providing a sufficiently strong global contraction.

major comments (3)

[Abstract and §3] Abstract and §3 (main analytical framework): the O(1/n) bound is obtained by claiming that smoothness plus the skew-symmetry-induced stability cancels the linear distribution-shift term. The stress-test concern is valid here: if stability is only local around the regularized Nash, the argument implicitly requires an unstated concentrability condition on the offline data distribution. The manuscript must explicitly delineate the region in which the quadratic contraction holds and confirm that no coverage assumption is hidden in the global bound.
[Theorem 1] Theorem 1 (pessimism-free sample-complexity result): the derivation reduces the statistical error via the stability property to achieve the claimed 1/n rate. Because the full expansion of the smoothness and stability steps is not reproduced in the provided text, it is impossible to verify whether the linear error terms are indeed squared or whether an extra 1/sqrt(n) factor remains after plugging in the estimation error. A complete, self-contained proof sketch or appendix derivation is required.
[§5] §5 (self-play algorithm and last-iterate guarantee): the claim that the iterative KL-regularized updates preserve the pessimism-free O(1/n) guarantee up to optimization error is load-bearing for the practical contribution. The optimization-error term must be shown to be o(1/n) or absorbed without degrading the overall rate; otherwise the end-to-end bound reverts to the slower rate of prior methods.

minor comments (2)

[Preliminaries] Notation for the KL-regularized best-response operator could be introduced earlier and used consistently to improve readability of the smoothness arguments.
[Related work] The related-work discussion of pessimistic offline RL methods is brief; adding one or two sentences contrasting the exact assumptions avoided here would help readers situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity of our contributions. We address each major comment below, providing clarifications and committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (main analytical framework): the O(1/n) bound is obtained by claiming that smoothness plus the skew-symmetry-induced stability cancels the linear distribution-shift term. The stress-test concern is valid here: if stability is only local around the regularized Nash, the argument implicitly requires an unstated concentrability condition on the offline data distribution. The manuscript must explicitly delineate the region in which the quadratic contraction holds and confirm that no coverage assumption is hidden in the global bound.

Authors: The stability property derived from skew-symmetry of the payoff matrix holds globally for the KL-regularized equilibrium. This global contraction, combined with the smoothness of the best-response maps, allows the linear distribution-shift terms to be squared, yielding the O(1/n) rate without requiring additional concentrability conditions beyond those implicit in the offline dataset. We will revise §3 to include an explicit statement delineating that the quadratic contraction is global and that no hidden coverage assumption is present in the bound. revision: yes
Referee: [Theorem 1] Theorem 1 (pessimism-free sample-complexity result): the derivation reduces the statistical error via the stability property to achieve the claimed 1/n rate. Because the full expansion of the smoothness and stability steps is not reproduced in the provided text, it is impossible to verify whether the linear error terms are indeed squared or whether an extra 1/sqrt(n) factor remains after plugging in the estimation error. A complete, self-contained proof sketch or appendix derivation is required.

Authors: We agree that a more self-contained derivation would aid verification. The full proof is in the appendix, but we will add a detailed proof sketch to the main text following Theorem 1. This sketch will expand the key steps: applying smoothness to bound the best-response deviation, then using the global stability to square the linear terms from distribution shift and estimation error, confirming no residual 1/sqrt(n) factor remains. revision: yes
Referee: [§5] §5 (self-play algorithm and last-iterate guarantee): the claim that the iterative KL-regularized updates preserve the pessimism-free O(1/n) guarantee up to optimization error is load-bearing for the practical contribution. The optimization-error term must be shown to be o(1/n) or absorbed without degrading the overall rate; otherwise the end-to-end bound reverts to the slower rate of prior methods.

Authors: In §5, we bound the optimization error of the iterative updates by O(1/T) where T is the number of iterations. By choosing T = Θ(n), the optimization error becomes O(1/n^2), which is absorbed into the higher-order terms of the O(1/n) statistical bound without degrading the leading rate. We will revise §5 to make this explicit, including the choice of T and the resulting end-to-end guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external properties of KL-regularization and skew symmetry

full rationale

The provided abstract and outline present the O(1/n) guarantee as following from smoothness of KL-regularized best responses and a stability property induced by skew symmetry of the zero-sum payoff. These are invoked as standard or assumable properties rather than being defined in terms of the target bound or fitted from the offline data. No equations, self-citations, or steps in the visible text reduce the claimed prediction to a fitted parameter or to the result itself by construction. The framework is therefore self-contained against external benchmarks for the purpose of this circularity check.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain properties of KL-regularized games that are invoked but not derived in the abstract: smoothness of best responses and skew-symmetry-induced stability of the Nash equilibrium. No free parameters or invented entities are mentioned.

axioms (2)

domain assumption KL-regularized best responses are smooth
Invoked to enable the new analytical framework without pessimism.
domain assumption Nash equilibrium in KL-regularized games has a stability property induced by skew symmetry
Used to control distribution shift and obtain the fast rate.

pith-pipeline@v0.9.0 · 5451 in / 1271 out tokens · 32847 ms · 2026-05-11T00:54:05.030191+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Offline Two-Player Zero-Sum Markov Games with KL Regularization
cs.LG 2026-05 unverdicted novelty 8.0

KL regularization enables Õ(1/n) convergence for offline Nash equilibria in zero-sum Markov games under unilateral concentrability via the ROSE framework and SOS-MD algorithm.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
cs.GT 2026-05 unverdicted novelty 7.0

Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent
cs.GT 2026-04 unverdicted novelty 7.0

OPMD achieves the first fast Õ(1/n) rate for offline Nash equilibrium learning in α-potential games via a new reference-anchored coverage framework.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
cs.GT 2026-05 unverdicted novelty 6.0

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
cs.LG 2026-05 unverdicted novelty 6.0

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization
cs.LG 2026-04 unverdicted novelty 6.0

KL regularization enables pessimism-free offline learning in general-sum games by recovering regularized Nash equilibria at rate O(1/n) via GANE and converging to coarse correlated equilibria at O(1/sqrt(n) + 1/T) via GAMD.

Reference graph

Works this paper leans on

4 extracted references · cited by 7 Pith papers

[1]

The operator Fg,x is strongly monotone due to the η−1-strong convexity of the KL regularization. Since the linear partM g,x is skew-symmetric (i.e.,⟨M g,xv, v⟩= 0for any vectorv), we have: ⟨Fg,x(π)−F g,x(π′), π−π ′⟩=η −1 2X i=1 ⟨∇KLi(πi)− ∇KL i(π′ i), πi −π ′ i⟩ ≥ 1 2η ∥π−π ′∥2 1. Applying this tobπx andπ ⋆ x with the operatorF g⋆,x: 1 2η ∥bπx −π ⋆ x∥2 1 ...

2024
[2]

Let Vt(x) =KL(bπ1(·|x)∥π(t) 1 (·|x))+KL(bπ2(·|x)∥π(t) 2 (·|x))

Pointwise Analysis.Fix an arbitrary context x. Let Vt(x) =KL(bπ1(·|x)∥π(t) 1 (·|x))+KL(bπ2(·|x)∥π(t) 2 (·|x)). We verify that the update rules in Algorithm 2 correspond to the OMD update in Lemma 12 with the following directionsδ i,t(x, a): δ1,t(x, a) =α tbf (t) 1 (x, a)−α tη−1 log π(t) 1 (a|x) πref 1 (a|x) + 1 ! , δ2,t(x, a) =−α tbf (t) 2 (x, a)−α tη−1 l...
[3]

(8) simplifies to: Vt+1(x)≤ t t+ 2 Vt(x) + 16η2 (t+ 2) 2

Solving the Recursion.Withα t = 2η t+2, Eq. (8) simplifies to: Vt+1(x)≤ t t+ 2 Vt(x) + 16η2 (t+ 2) 2 . We proveV t(x)≤ 16η2 t+1 by induction. Fort= 0,V 1(x)≤ 16η2 4 = 4η2 ≤8η 2. AssumeV t(x)≤ 16η2 t+1 . Then: Vt+1(x)≤ t t+ 2 16η2 t+ 1 + 16η2 (t+ 2) 2 = 16η2 t(t+ 2) + (t+ 1) (t+ 1)(t+ 2) 2 = 16η2 t2 + 3t+ 1 (t+ 1)(t+ 2) 2 . Sincet 2 + 3t+ 1≤(t+ 1)(t+ 2), w...
[4]

21 C.2 Proof of Lemma 7 Proof

Aggregation.Taking the expectation overx∼ρyieldsE x∼ρ[VT (x)]≤ 16η2 T+1 . 21 C.2 Proof of Lemma 7 Proof. The stability result for the regularized Nash equilibrium established in Lemma 7 follows directly from the perturbation analysis and concentrability arguments developed in Section 3. Specifically, from the stability bound established in Lemma 4, the sq...