Minimax Strikes Back
Pith reviewed 2026-05-24 14:18 UTC · model grok-4.3
The pith
Athénan uses Descent search and no policy to generate training data 296 times cheaper than Polygames while matching performance with far less compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Athénan reaches high-level play in several games by combining Descent minimax search with modified value targets and the absence of any policy, producing state data at approximately 296 times lower cost than Polygames and training at least seven times faster on identical resources.
What carries the argument
Descent search, a minimax-based algorithm that drives self-play without a policy network and supplies the altered learning targets used for training.
If this is right
- Training runs that previously required large GPU clusters become feasible on modest hardware.
- The same wall-clock budget yields either higher final strength or the ability to train on additional games.
- Methods that avoid policy heads and Monte Carlo rollouts can still produce strong agents when paired with efficient minimax search.
- Data-generation cost becomes a more central design variable when comparing learning algorithms for games.
Where Pith is reading between the lines
- The efficiency pattern may appear in other domains where self-play data is expensive to label, such as certain planning or control tasks.
- Altering the precise targets used for value learning could be tested independently of the search method to isolate their contribution.
- If the advantage survives controlled reimplementation, Descent-style search offers a practical route to scaling game-learning experiments without proportional increases in compute.
Load-bearing premise
The measured speed and data-cost advantages come from the algorithmic differences in search, targets, and policy use rather than from implementation details, hyper-parameter choices, or undisclosed game-specific tuning.
What would settle it
A side-by-side reimplementation of both Athénan and Polygames inside one shared codebase with matched hyper-parameters that eliminates the reported differences in data-generation cost and training speed.
Figures
read the original abstract
Deep Reinforcement Learning reaches a superhuman level of play in many complete information games. The state of the art algorithm for learning with zero knowledge is AlphaZero. We take another approach, Ath\'enan, which uses a different, Minimax-based, search algorithm called Descent, as well as different learning targets and that does not use a policy. We show that for multiple games it is much more efficient than the reimplementation of AlphaZero: Polygames. It is even competitive with Polygames when Polygames uses 100 times more GPU (at least for some games). One of the keys to the superior performance is that the cost of generating state data for training is approximately 296 times lower with Ath\'enan. With the same reasonable ressources, Ath\'enan without reinforcement heuristic is at least 7 times faster than Polygames and much more than 30 times faster with reinforcement heuristic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Athénan, a zero-knowledge learning method for complete-information games that replaces AlphaZero's MCTS with a Descent minimax search, omits the policy head, and uses altered learning targets. It reports that Athénan is substantially more efficient than Polygames (a reimplementation of AlphaZero), achieving an approximately 296-fold reduction in the cost of generating state data for training, 7–30× faster training under comparable resources, and competitive performance against Polygames even when the latter is allocated 100× more GPU resources (at least in some games).
Significance. If the reported efficiency gains are robustly attributable to the algorithmic distinctions rather than implementation or tuning differences, the result would be significant: it would provide concrete evidence that minimax-based search can outperform MCTS-based methods on resource efficiency in multi-game settings, potentially broadening the design space for self-play learning algorithms beyond the AlphaZero paradigm.
major comments (2)
- [Abstract and experimental section] Abstract and experimental section: the headline quantitative claims (296× data-generation cost ratio, 7–30× speed-up, competitiveness at 100× GPU) are stated without an experimental protocol, error bars, explicit game list, or any description of how the Polygames reimplementation was configured or optimized (hyper-parameters, batch sizes, move-generation caching, etc.). This leaves open whether the reported ratios arise from the listed algorithmic changes (Descent search, absence of policy head, altered targets) or from unstated engineering choices.
- [Experimental section] No derivation, pseudocode, or precise definition is supplied for the 'cost of generating state data' metric that yields the 296× figure; without this, it is impossible to assess whether the metric fairly compares the two systems or incorporates game-specific implementation speed-ups.
minor comments (2)
- [Abstract] The abstract refers to 'reinforcement heuristic' without defining the term or indicating where it is introduced in the method.
- [Experimental section] No mention of statistical significance testing or variance across random seeds for the speed and performance comparisons.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments. We agree that the manuscript would benefit from more detailed descriptions of the experimental setup and metrics. Below we respond to each major comment and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and experimental section] Abstract and experimental section: the headline quantitative claims (296× data-generation cost ratio, 7–30× speed-up, competitiveness at 100× GPU) are stated without an experimental protocol, error bars, explicit game list, or any description of how the Polygames reimplementation was configured or optimized (hyper-parameters, batch sizes, move-generation caching, etc.). This leaves open whether the reported ratios arise from the listed algorithmic changes (Descent search, absence of policy head, altered targets) or from unstated engineering choices.
Authors: We thank the referee for pointing this out. The current manuscript indeed presents the quantitative results in the abstract and experimental section without sufficient accompanying details on the protocol. In the revised version, we will expand the experimental section to include: the full list of games tested, error bars or variance measures for the reported ratios, a step-by-step description of the experimental protocol, and comprehensive details on the Polygames reimplementation including all hyperparameters, batch sizes, and any optimizations such as move-generation caching. This will allow readers to better assess the source of the efficiency gains. revision: yes
-
Referee: [Experimental section] No derivation, pseudocode, or precise definition is supplied for the 'cost of generating state data' metric that yields the 296× figure; without this, it is impossible to assess whether the metric fairly compares the two systems or incorporates game-specific implementation speed-ups.
Authors: We agree that a precise definition and derivation of the 'cost of generating state data' metric is essential for reproducibility and fair comparison. We will add this to the revised manuscript, including a mathematical definition, pseudocode for how it is computed, and an explanation of why it is a relevant and fair metric that does not rely on game-specific implementation tricks but rather on the fundamental differences in search and learning. revision: yes
Circularity Check
No circularity: efficiency claims are direct empirical measurements with no derivation chain or self-referential inputs
full rationale
The paper reports observed performance ratios (296x data cost, 7-30x speed) between Athénan (Descent search, no policy head) and a Polygames reimplementation of AlphaZero. These are presented as experimental outcomes rather than quantities derived from equations, fitted parameters renamed as predictions, or load-bearing self-citations. No mathematical derivation chain exists in the provided text; the central claims rest on direct runtime and data-generation measurements, which are externally falsifiable and do not reduce to the paper's own inputs by construction. This is the normal case of an empirical comparison paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47(2-3): 235--256
work page 2002
-
[4]
Browne, C.; Powley, E.; Whitehouse, D.; Lucas, S.; Cowling, P.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A Survey of M onte C arlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1): 1--43
work page 2012
-
[5]
Browne, C.; Stephenson, M.; Piette, \'E .; and Soemers, D. J. 2019. A Practical Introduction to the Ludii General Game System. Advances in Computer Games. Springer
work page 2019
-
[6]
Cazenave, T. 2015. Generalized Rapid Action Value Estimation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015 , 754--760
work page 2015
-
[7]
Cazenave, T. 2016. Playout policy adaptation with move features. Theor. Comput. Sci. 644: 43--52
work page 2016
-
[8]
Cazenave, T.; Chen, Y.-C.; Chen, G.-W.; Chen, S.-Y.; Chiu, X.-D.; Dehos, J.; Elsa, M.; Gong, Q.; Hu, H.; Khalidov, V.; Cheng-Ling, L.; Lin, H.-I.; Lin, Y.-J.; Martinet, X.; Mella, V.; Rapin, J.; Roziere, B.; Synnaeve, G.; Teytaud, F.; Teytaud, O.; Ye, S.-C.; Ye, Y.-J.; Yen, S.-J.; and Zagoruyko, S. 2020. Polygames: Improved Zero Learning. ICGA Journal 42(4)
work page 2020
-
[9]
Cazenave, T.; and Saffidine, A. 2009. Utilisation de la recherche arborescente Monte-Carlo au Hex. Revue d'Intelligence Artificielle 23(2-3): 183--202
work page 2009
-
[10]
Chaslot, G. M.-B.; Winands, M. H.; and van Den Herik, H. J. 2008. Parallel monte-carlo tree search. In International Conference on Computers and Games, 60--71. Springer
work page 2008
- [11]
-
[12]
Coulom, R. 2006. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In Computers and Games, volume 4630 of Lecture Notes in Computer Science, 72--83. Springer
work page 2006
-
[13]
Emslie, R. 2019. Galvanise Zero. https://github.com/richemslie/galvanise_zero
work page 2019
-
[14]
Fink, W. 1982. An Enhancement to the Iterative, Alpha-Beta, Minimax Search Procedure. ICGA Journal 5(1): 34--35
work page 1982
-
[15]
Gelly, S.; and Silver, D. 2011. Monte-Carlo tree search and rapid action value estimation in computer Go . Artif. Intell. 175(11): 1856--1875
work page 2011
-
[16]
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse rectifier neural networks. In The Fourteenth International Conference on Artificial Intelligence and Statistics, 315--323
work page 2011
-
[17]
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 770--778
work page 2016
-
[18]
B.; M \"u ller, M.; and Pawlewicz, J
Huang, S.-C.; Arneson, B.; Hayward, R. B.; M \"u ller, M.; and Pawlewicz, J. 2013. MoHex 2.0: a pattern-based MCTS Hex player. In International Conference on Computers and Games, 60--71. Springer
work page 2013
-
[19]
Knuth, D. E.; and Moore, R. W. 1975. An Analysis of Alpha-Beta Pruning. Artif. Intell. 6(4): 293--326. doi:10.1016/0004-3702(75)90019-3. ://dx.doi.org/10.1016/0004-3702(75)90019-3
-
[20]
Kocsis, L.; and Szepesv\'ari, C. 2006. Bandit based M onte- C arlo planning. In 17th European Conference on Machine Learning (ECML'06), volume 4212 of LNCS, 282--293. Springer
work page 2006
-
[21]
Korf, R. E.; and Chickering, D. M. 1996. Best-first minimax search. Artificial intelligence 84(1-2): 299--337
work page 1996
-
[22]
LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature 521(7553): 436--444
work page 2015
-
[23]
Marsland, T. A. 1987. Computer chess methods. Encyclopedia of Artificial Intelligence 1: 159--171
work page 1987
-
[24]
Pascutto, G.-C. 2017. Leela Zero. https://github.com/leela-zero/leela-zero
work page 2017
-
[25]
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tre...
work page 2016
-
[26]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419): 1140--1144
work page 2018
-
[27]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of go without human knowledge. nature 550(7676): 354--359
work page 2017
- [28]
-
[29]
Wang, Y.; and Gelly, S. 2007. Modifications of UCT and sequence-like simulations for Monte-Carlo Go. In 2007 IEEE Symposium on Computational Intelligence and Games, 175--182. IEEE
work page 2007
- [30]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.