arxiv: 2604.25796 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

StratFormer: Adaptive Opponent Modeling and Exploitation in Imperfect-Information Games

Andy Caen , Mark H.M. Winands , Dennis J.N.J. Soemers

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords opponent modelingexploitationimperfect information gamestransformerpokerGTO policybest responsecurriculum learning

0 comments

The pith

StratFormer learns opponent patterns during safe GTO play then shifts toward exploitation in imperfect-information games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

StratFormer shows that a transformer can build an opponent model from action histories while initially following a game-theoretic optimal policy, then gradually adjust its own policy toward best-response play against that model. The shift is controlled by a regularization schedule that depends on each opponent's measured exploitability. If this curriculum works, agents gain value from weak opponents without large losses to strong ones. The architecture uses dual-turn tokens that encode decision points for both the agent and the opponent, plus bucket-rate features that summarize tendencies across strategic contexts.

Core claim

StratFormer achieves an average exploitation gain of +0.106 BB per hand over GTO on Leduc Hold'em, with peak gains of +0.821 BB per hand against highly exploitable opponents, while maintaining near-equilibrium safety.

What carries the argument

Dual-turn tokens at both agent and opponent decision points together with bucket-rate features that encode tendencies across five strategic contexts, inside a transformer trained by the two-phase curriculum.

If this is right

The agent extracts positive value from weak opponents while staying close to GTO performance against strong ones.
The modeling head and exploitation policy can be trained together without one destroying the other.
Gains increase with opponent exploitability, reaching the largest improvements against the weakest archetypes.
The same architecture preserves safety across a range of opponent strengths in the tested game.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested for generalization to opponents not seen during training by holding out some archetypes.
If scaled, the same curriculum might apply to larger poker variants where full GTO solutions are unavailable.
The dual-turn token design might transfer to other hidden-information settings such as negotiation or security games.

Load-bearing premise

The two-phase curriculum with per-opponent regularization tied to exploitability enables simultaneous modeling and exploitation without instability, overfitting, or loss of safety.

What would settle it

A head-to-head evaluation on the six tested Leduc Hold'em opponent archetypes in which StratFormer produces no positive average gain over GTO or violates near-equilibrium safety bounds.

Figures

Figures reproduced from arXiv: 2604.25796 by Andy Caen, Dennis J.N.J. Soemers, Mark H.M. Winands.

**Figure 1.** Figure 1: StratFormer architecture. Dual-turn tokens from agent and opponent decision points are processed by a shared causal transformer. The policy head outputs the agent’s action distribution; the opponent modeling head is trained via supervised learning to predict the opponent action view at source ↗

read the original abstract

We present StratFormer, a transformer-based meta-agent that learns to simultaneously model and exploit opponents in imperfect-information games through a two-phase curriculum. The first phase trains an opponent modeling head to identify behavioral patterns from action histories while the agent plays a game-theoretic optimal (GTO) policy. The second phase progressively shifts the policy toward best-response (BR) exploitation, guided by a per-opponent regularization schedule tied to exploitability. Our architecture introduces dual-turn tokens -- feature vectors constructed at both agent and opponent decision points -- coupled with bucket-rate features that encode opponent tendencies across five strategic contexts. On Leduc Hold'em, a small poker variant with six cards and two betting rounds, we test against six opponent archetypes at two strength levels each, with exploitability ranging from 0.15 to 1.26 Big Blinds (BB) per hand. StratFormer achieves an average exploitation gain of +0.106 BB per hand over GTO, with peak gains of +0.821 against highly exploitable opponents, while maintaining near-equilibrium safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StratFormer adds dual-turn tokens and a per-opponent regularization curriculum to shift from GTO to exploitation in Leduc, but the stability of that shift rests on untested assumptions.

read the letter

The paper's core contribution is a transformer agent that first learns opponent patterns while playing GTO, then uses a curriculum to move toward best-response exploitation guided by a regularization schedule based on estimated exploitability. It reports an average gain of +0.106 BB per hand over GTO on Leduc Hold'em against six archetypes, with larger gains against weak opponents and claims of maintained safety. That setup is a reasonable extension of existing transformer and opponent-modeling ideas in game AI, and the dual-turn tokens plus bucket-rate features give a concrete way to encode both players' histories and tendencies across contexts. Testing on multiple opponent strengths is a plus and shows the method can handle varying exploitability levels without immediate collapse. The results are presented clearly enough to see the claimed effect sizes. The main weakness is that the central mechanism—the per-opponent regularization tied to exploitability—has no reported ablations or step-by-step checks on policy exploitability during the curriculum. Without those, it's unclear whether the schedule actually prevents drift or overfitting as the opponent model improves, especially since exploitability estimates depend on the model itself. The abstract also omits details on exact baselines, training runs, error bars, or how the GTO baseline was computed, which makes the numbers harder to interpret. In a small game like Leduc these gaps matter because overfitting to the training set is easy. This work is mainly for researchers in game AI who focus on imperfect-information settings and practical exploitation. Readers already working on poker bots or transformer applications to games will find the architecture and curriculum details worth looking at, though the single-domain results limit broader takeaways. It has a clear method and empirical claims, so it deserves a serious referee rather than a desk reject. I would send it for review but ask specifically for ablations on the regularization schedule and fuller reporting of training and evaluation procedures.

Referee Report

2 major / 2 minor

Summary. The manuscript presents StratFormer, a transformer-based meta-agent for imperfect-information games that uses a two-phase curriculum: an initial phase training an opponent-modeling head while following a GTO policy, followed by a progressive shift to best-response exploitation guided by a per-opponent regularization schedule derived from estimated exploitability. The architecture adds dual-turn tokens and bucket-rate features encoding tendencies across five strategic contexts. On Leduc Hold'em against six opponent archetypes (two strength levels each, exploitability 0.15–1.26 BB/hand), it reports an average exploitation gain of +0.106 BB/hand over GTO, with a peak of +0.821 against highly exploitable opponents, while claiming near-equilibrium safety.

Significance. If the empirical results and stability claims hold after verification, the work provides a concrete method for safely combining opponent modeling with exploitation in IIGs, which could influence adaptive agents in poker and related domains. The dual-turn token and bucket-rate feature ideas are potentially reusable contributions, though the evaluation remains confined to a small game and the absence of ablations limits immediate generalizability.

major comments (2)

[§5] §5 (Experiments): No details are provided on the exact baselines, number of evaluation hands, statistical tests, error bars, training procedures, or the precise method used to compute exploitability and the reported gains (+0.106 BB average, +0.821 peak); without these the central empirical claim cannot be verified or reproduced.
[§4.2] §4.2 (Curriculum and Regularization): The per-opponent regularization schedule tied to exploitability is presented as the mechanism preventing policy drift during the GTO-to-BR shift, yet no ablation removing the schedule, no intermediate exploitability measurements of the policy at each curriculum step, and no held-out safety evaluation are reported; this leaves the stability assumption untested and load-bearing for the safety claim.

minor comments (2)

[Abstract] Abstract and §3: The phrase 'near-equilibrium safety' is used without a quantitative definition or reported exploitability value for the final policy against held-out opponents.
[§3.1] §3.1: The construction of bucket-rate features would benefit from an explicit formula or pseudocode showing how the five strategic contexts are aggregated from action histories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that identify key gaps in experimental detail and validation. We address each major comment point by point below and will revise the manuscript accordingly to improve reproducibility and strengthen the empirical support for our claims.

read point-by-point responses

Referee: §5 (Experiments): No details are provided on the exact baselines, number of evaluation hands, statistical tests, error bars, training procedures, or the precise method used to compute exploitability and the reported gains (+0.106 BB average, +0.821 peak); without these the central empirical claim cannot be verified or reproduced.

Authors: We agree that these specifics are required for verification and reproducibility. In the revised manuscript we will expand Section 5 to specify: the exact baselines (GTO solved via CFR plus additional reference agents); the evaluation protocol (100,000 hands per matchup, averaged over 10 independent runs with variance reduction); statistical tests (paired t-tests with reported p-values); error bars (standard error of the mean); full training procedures (hyperparameters, optimizer, epochs, and curriculum transition schedule); and the precise exploitability computation (using the Leduc CFR solver to obtain expected value differences). These additions will directly substantiate the reported average gain of +0.106 BB/hand and peak of +0.821 BB/hand. revision: yes
Referee: §4.2 (Curriculum and Regularization): The per-opponent regularization schedule tied to exploitability is presented as the mechanism preventing policy drift during the GTO-to-BR shift, yet no ablation removing the schedule, no intermediate exploitability measurements of the policy at each curriculum step, and no held-out safety evaluation are reported; this leaves the stability assumption untested and load-bearing for the safety claim.

Authors: We acknowledge that the stability claim rests on the regularization schedule and that the absence of ablations and intermediate measurements is a genuine limitation. We will add to the revised manuscript: an ablation comparing performance with and without the per-opponent regularization schedule; plots of policy exploitability measured at each curriculum step for representative opponents; and a held-out safety evaluation on additional opponent types. These experiments will provide direct empirical support for the near-equilibrium safety observed in the current results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external game baselines and measured exploitability, not self-referential fits or citations.

full rationale

The paper describes a two-phase curriculum that trains an opponent-modeling head on GTO play then shifts toward best-response exploitation using a regularization schedule explicitly tied to per-opponent exploitability values. All reported gains (+0.106 BB average, peak +0.821) are computed against independently defined GTO baselines and exploitability ranges (0.15–1.26 BB) on Leduc Hold'em; no equation or claim reduces the target metric to a fitted parameter or prior self-citation. The architecture (dual-turn tokens, bucket-rate features) is presented as a design choice without uniqueness theorems or ansatz smuggling. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of the described curriculum and features; no explicit free parameters, axioms, or invented entities are quantified, but the approach implicitly assumes standard transformer training works for this domain and that the new feature types capture useful patterns.

free parameters (1)

per-opponent regularization schedule parameters
The schedule is tied to exploitability but its exact form and tuning values are not specified in the abstract.

axioms (1)

domain assumption The two-phase curriculum separates opponent modeling from exploitation without interference
Invoked by the description of first training the modeling head while playing GTO then shifting policy.

invented entities (2)

dual-turn tokens no independent evidence
purpose: Feature vectors constructed at both agent and opponent decision points
New input representation introduced in the architecture.
bucket-rate features no independent evidence
purpose: Encode opponent tendencies across five strategic contexts
New feature type for capturing behavioral patterns.

pith-pipeline@v0.9.0 · 5490 in / 1531 out tokens · 51454 ms · 2026-05-07T16:27:53.373045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 1 internal anchor

[1]

In: AAMAS, pp

Bard, N., Johanson, M., Burch, N., Bowling, M.: Online implicit agent modelling. In: AAMAS, pp. 255–262 (2013)

2013
[2]

In: ICML, pp

Brown, N., Lerer, A., Gross, S., Sandholm, T.: Deep counterfactual regret mini- mization. In: ICML, pp. 793–802 (2019)

2019
[3]

Science 365(6456), 885–890 (2019)

Brown, N., Sandholm, T.: Superhuman AI for multiplayer poker. Science 365(6456), 885–890 (2019)

2019
[4]

In: NeurIPS, pp

Chen,L.,Lu, K.,Rajeswaran, A.,Lee,K., Grover, A.,Laskin,M., Abbeel, P.,Srini- vas, A., Mordatch, I.: Decision Transformer: Reinforcement learning via sequence modeling. In: NeurIPS, pp. 15084–15097 (2021)

2021
[5]

H. H. L. M. Donkers: NOSCE HOSTEM: Searching with Opponent Models. PhD thesis, Department of Computer Science, Universiteit Maastricht, Maastricht, the Netherlands, 2003

2003
[6]

In: ICML, pp

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML, pp. 1126–1135 (2017)

2017
[7]

In: AAMAS, pp

Foerster, J., Chen, R.Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., Mordatch, I.: Learning with opponent-learning awareness. In: AAMAS, pp. 122–130 (2018)

2018
[8]

ACM Trans

Ganzfried, S., Sandholm, T.: Safe opponent exploitation. ACM Trans. Econ. Com- put.3(2), 1–28 (2015)

2015
[9]

In: ICML, pp

He, H., Boyd-Graber, J., Kwok, K., Daumé III, H.: Opponent modeling in deep reinforcement learning. In: ICML, pp. 1804–1813 (2016)

2016
[10]

Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect- information games. arXiv:1603.01121 (2016)

work page arXiv 2016
[11]

Gaussian Error Linear Units (GELUs)

Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (GELUs). arXiv:1606.08415 (2016)

work page internal anchor Pith review arXiv 2016
[12]

Part 1: The domain of applicability

Iida, H., Uiterwijk, J., van den Herik, H., Herschberg, B.: Potential applications of opponent-model search. Part 1: The domain of applicability. ICCA Journal16(4), 201–208 (1993)

1993
[13]

In: NeurIPS, pp

Lanctot,M.,Zambaldi,V.,Gruslys,A.,Lazaridou,A.,Tuyls,K.,Pérolat,J.,Silver, D., Graepel, T.: A unified game-theoretic approach to multiagent reinforcement learning. In: NeurIPS, pp. 4190–4203 (2017)

2017
[14]

Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

Lanctot, M., et al.: OpenSpiel: A framework for reinforcement learning in games. arXiv:1908.09453 (2019)

work page arXiv 1908
[15]

McGraw-Hill (2015)

Law, A.M.: Simulation Modeling and Analysis, 5th edn. McGraw-Hill (2015)

2015
[16]

In: ICLR (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

2019
[17]

Science356(6337), 508–513 (2017)

Moravčík, M., et al.: DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science356(6337), 508–513 (2017)

2017
[18]

In: ICML, pp

Rabinowitz, N.C., Perbet, F., Song, H.F., Zhang, C., Eslami, S.M.A., Botvinick, M.: Machine theory of mind. In: ICML, pp. 4218–4227 (2018)

2018
[19]

In: UAI, pp

Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., Rayner, C.: Bayes’ bluff: Opponent modelling in poker. In: UAI, pp. 550–558 (2005)

2005
[20]

In: NeurIPS, pp

Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games with incomplete information. In: NeurIPS, pp. 1729–1736 (2007)

2007
[21]

In: NeurIPS, pp

Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

2017