Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Chun Kai Ling; Leyang Shen; Tat-Seng Chua; Xiaoyan Zhao; Yang Zhang

arxiv: 2606.19308 · v1 · pith:VZ2BWPELnew · submitted 2026-06-17 · 💻 cs.CL · cs.MA

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Leyang Shen , Yang Zhang , Xiaoyan Zhao , Chun Kai Ling , Tat-Seng Chua This is my paper

Pith reviewed 2026-06-26 20:56 UTC · model grok-4.3

classification 💻 cs.CL cs.MA

keywords multi-agent systemslarge language modelsfictitious playdecision-makingstance entanglementgame theoryequilibrium seeking

0 comments

The pith

MAFP lets LLM agents resolve stance entanglement by iteratively best-responding to the empirical mixture of each other's past decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that decision tasks involving mutually dependent stakeholder positions cannot be handled by dividing work across agents and instead require an equilibrium-seeking process. It models each stance as a separate agent and applies the fictitious-play rule so that every agent updates its choice to the best response against the average of the others' historical decisions. This repeated adjustment is claimed to surface weaknesses and produce joint decisions that score higher on tournament strength and robustness than either single-pass or ordinary multi-round prompting. A reader would care because many practical choices, such as competitive strategy or negotiation, involve exactly this kind of interdependence. If the updates remain consistent, the method supplies a game-theoretic mechanism for improving LLM outputs on such tasks without external solvers.

Core claim

Multi-Agent Fictitious Play (MAFP) represents each stakeholder stance as an agent and formulates the decision problem as an equilibrium-seeking process in which every agent repeatedly replaces its decision with the best response to the empirical mixture of all other agents' previous decisions, thereby addressing stance entanglement and yielding decisions that outperform both single-round and multi-round baselines on tournament strength and robustness metrics.

What carries the argument

The fictitious-play update rule, in which each agent selects its next decision as the best response to the historical average of the other agents' decisions, carried out through LLM prompting.

If this is right

Agents expose and correct one another's decision weaknesses through successive best-response updates.
The resulting decisions score higher on both tournament strength and robustness than single-round or multi-round baselines.
The method applies directly to tasks that require simultaneous reasoning across interdependent stances.
Decision quality improves progressively rather than in a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting-based best-response loop could be applied to negotiation or resource-allocation settings that are not framed as tournaments.
Scalability questions arise once the number of agents grows beyond the small numbers tested, because each iteration requires additional LLM calls.
If the best-response step proves reliable, the approach could be layered on top of existing multi-agent debate or voting procedures.
The framework supplies a concrete test for whether current LLMs can maintain coherent best-response behavior across multiple rounds without external stabilization.

Load-bearing premise

That LLM prompting will produce consistent best responses to the empirical mixture of past decisions and that the sequence of updates will converge to higher-quality joint decisions rather than cycle or amplify inconsistencies.

What would settle it

Running MAFP on a fixed collection of decision tasks and observing either no measurable gain in tournament strength after several iterations or repeated cycling between the same small set of decisions.

Figures

Figures reproduced from arXiv: 2606.19308 by Chun Kai Ling, Leyang Shen, Tat-Seng Chua, Xiaoyan Zhao, Yang Zhang.

**Figure 1.** Figure 1: Existing MAS address execution complexity, as in software engineering or research (left), by dividing a task [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of MAFP algorithm. Fictitious play in game theory finds equilibrium through an iteratively convergent process in which each player best responds to the empirical average of others’ past actions, here converging to the Nash equilibrium of rock–paper–scissors. Inspired by this, multi-agent fictitious play (MAFP) decomposes stances into agents and finds policies through multi-agent co-evolution: … view at source ↗

**Figure 3.** Figure 3: Per-iteration quality of policies produced by each iterative method. For each method, we run an internal tournament among its four iterations and report each iteration’s average utility against the other three. The shaded band shows the standard error of the mean. 4.2 RQ1: MAFP versus Existing Test-Time Scaling Frameworks To evaluate the effectiveness of MAFP, we compare it against a representative set of … view at source ↗

**Figure 4.** Figure 4: Target-profile utility under adversarial evolution during robustness evaluation. Each curve shows a method’s per-iteration utility against an evolving attacker, averaged across scenarios. The star marks each method’s worst-case round. Shaded band shows the standard error of the mean. Dynamics in Policy Generation. To answer RQ2, we examine the two iterative processes in our paper: the policygeneration pro… view at source ↗

**Figure 5.** Figure 5: Per-method results with error bars visualization. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for the aggregation operator [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for the best-response operator [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for the action model Mact. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAFP applies fictitious play to LLM agents for stance entanglement, but the performance claims rest on thin experimental reporting.

read the letter

The paper's main move is to treat interdependent decision tasks as a game where LLM agents represent stakeholder stances and iteratively best-respond to the empirical distribution of everyone else's past choices. This is a direct lift of fictitious play, positioned against the usual divide-and-conquer MAS setups.

It does a clean job naming stance entanglement as the distinct difficulty and showing why single-round or fixed multi-round prompting falls short when decisions are mutually dependent. The iterative update rule is a natural fit for the setting and draws on a well-understood equilibrium concept.

The evaluation claims are the weak point. The abstract states that MAFP beats baselines on tournament strength and robustness, yet supplies no task descriptions, metric definitions, iteration counts, prompting details, or variance numbers. Without those, it is impossible to judge whether the gains come from the fictitious-play structure or simply from more rounds of interaction. The assumption that LLM best responses are consistent enough to improve joint quality rather than cycle is left unexamined.

This is aimed at researchers working on LLM agents for negotiation or competitive strategy. Someone already thinking about game-theoretic MAS would find the framing useful as a starting point.

I would send it to peer review. The idea is distinct enough and the problem is real, but the methods and results sections will need substantial expansion before the claims can be assessed.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Multi-Agent Fictitious Play (MAFP), a multi-agent LLM paradigm that models stakeholder stances as agents and casts decision-making as an iterative equilibrium-seeking process based on the game-theoretic fictitious-play update rule. Each agent best-responds to the empirical mixture of other agents' historical decisions, with the goal of exposing and mitigating stance entanglement. The central empirical claim is that MAFP outperforms both single-round and multi-round baselines on tournament strength and robustness metrics across competitive decision-making tasks.

Significance. If the reported gains hold under rigorous controls, the work supplies a concrete, principle-driven mechanism for improving joint decision quality in LLM-based MAS where decisions are interdependent. It directly imports an established convergence concept from game theory rather than inventing new dynamics, which is a strength.

major comments (2)

[Abstract, §4] Abstract and §4 (Evaluation): the central claim of outperformance on tournament strength and robustness is stated without definitions of either metric, without the number of runs, without statistical tests, and without controls for prompt variability or LLM stochasticity. These omissions make the quantitative results impossible to interpret or reproduce and are load-bearing for the main contribution.
[§3] §3 (Method): the fictitious-play update is described at a high level via LLM prompting, but no concrete prompt templates, temperature settings, or stopping criteria are supplied. Without these, it is impossible to assess whether the iterative best-response step reliably produces consistent improvements or merely cycles, which directly affects the weakest assumption identified in the stress test.

minor comments (2)

[Introduction] The term 'stance entanglement' is introduced without a formal definition or citation to prior literature on interdependent decision problems.
[§3] No pseudocode or algorithmic listing of the MAFP loop is provided, which would clarify the exact sequence of best-response calls and mixture updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of major revision. We address each major comment point by point below and will revise the manuscript to improve reproducibility and clarity.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Evaluation): the central claim of outperformance on tournament strength and robustness is stated without definitions of either metric, without the number of runs, without statistical tests, and without controls for prompt variability or LLM stochasticity. These omissions make the quantitative results impossible to interpret or reproduce and are load-bearing for the main contribution.

Authors: We agree that these details are essential for interpreting and reproducing the results. In the revised manuscript, we will add explicit definitions of tournament strength and robustness in the abstract and §4, report the exact number of runs, include appropriate statistical tests (such as paired t-tests or Wilcoxon signed-rank tests with p-values), and describe controls for prompt variability and LLM stochasticity (e.g., fixed random seeds, multiple prompt paraphrases, and temperature settings). revision: yes
Referee: [§3] §3 (Method): the fictitious-play update is described at a high level via LLM prompting, but no concrete prompt templates, temperature settings, or stopping criteria are supplied. Without these, it is impossible to assess whether the iterative best-response step reliably produces consistent improvements or merely cycles, which directly affects the weakest assumption identified in the stress test.

Authors: We acknowledge that the current description in §3 is insufficient for full reproducibility and evaluation of the update process. In the revised version, we will supply the complete prompt templates for the best-response and mixture steps, specify the LLM temperature and other generation parameters, and detail the stopping criteria (such as maximum iterations or convergence thresholds based on decision stability). revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents MAFP as a direct application of the established game-theoretic fictitious play process to LLM agents, without any equations, parameter fitting, or derivations that reduce outputs to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the iterative best-response mechanism is described as an external principle rather than an internally defined or renamed construct. The evaluation claims are empirical (outperformance on tournament strength and robustness) and do not rely on self-referential steps. This matches the default case of a self-contained application of prior theory.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate concrete free parameters, axioms, or invented entities; the method description implies standard LLM prompting and game-theoretic iteration but does not specify any fitted values or new postulates.

pith-pipeline@v0.9.1-grok · 5752 in / 1009 out tokens · 22961 ms · 2026-06-26T20:56:03.410827+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 10 linked inside Pith

[1]

Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

2025
[2]

Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

arXiv 2025
[3]

Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

Suma Bailis, Jane Friedhoff, and Feiyang Chen. Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

arXiv 2024
[4]

How well can llms negotiate? negotiationarena platform and analysis

Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can llms negotiate? negotiationarena platform and analysis. InInternational Conference on Machine Learning, pages 3935–3951. PMLR, 2024

2024
[5]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13(1):374, 1951

1951
[6]

Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, and Kyle Richardson. Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

arXiv 2023
[7]

Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments

Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, and Lijie Wen. Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13055–13077, 2024

2024
[8]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InInternational Conference on Learning Representations, volume 2024, pages 20094–20136, 2024

2024
[9]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

arXiv 2024
[10]

Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models.arXiv preprint arXiv:2407.07086, 2024

Logan Cross, Violet Xiang, Agam Bhatia, Daniel LK Yamins, and Nick Haber. Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models.arXiv preprint arXiv:2407.07086, 2024

arXiv 2024
[11]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

Pith/arXiv arXiv 2025
[12]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024
[13]

Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations.Advances in Neural Information Processing Systems, 37:28219–28253, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations.Advances in Neural Information Processing Systems, 37:28219–28253, 2024

2024
[14]

Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

2022
[15]

Theory of mind.Current biology, 15(17):R644–R645, 2005

Chris Frith and Uta Frith. Theory of mind.Current biology, 15(17):R644–R645, 2005

2005
[16]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[17]

Agentgroupchat-v2: Divide-and-conquer is what llm-based multi-agent system need.arXiv preprint arXiv:2506.15451, 2025

Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, et al. Agentgroupchat-v2: Divide-and-conquer is what llm-based multi-agent system need.arXiv preprint arXiv:2506.15451, 2025

arXiv 2025
[18]

Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4.arXiv preprint arXiv:2309.17277, 2023

Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4.arXiv preprint arXiv:2309.17277, 2023

arXiv 2023
[19]

A general theory of equilibrium selection in games.MIT Press Books, 1, 1988

John C Harsanyi and Reinhard Selten. A general theory of equilibrium selection in games.MIT Press Books, 1, 1988. 10 APREPRINT

1988
[20]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024

2024
[21]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

2023
[22]

Fast algorithms for finding randomized strategies in game trees

Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. Fast algorithms for finding randomized strategies in game trees. InSymposium on the Theory of Computing, 1994

1994
[23]

Extensive games and the problem of information.Contributions to the Theory of Games, 2(28):193–216, 1953

Harold W Kuhn. Extensive games and the problem of information.Contributions to the Theory of Games, 2(28):193–216, 1953

1953
[24]

Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems.Advances in Neural Information Processing Systems, 37:53418–53437, 2024

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, and Caiwen Ding. Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems.Advances in Neural Information Processing Systems, 37:53418–53437, 2024

2024
[25]

Lemke and Jr

Carlton E. Lemke and Jr. Joseph T. Howson. Equilibrium points of bimatrix games.Journal of The Society for Industrial and Applied Mathematics, 12:413–423, 1964

1964
[26]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023
[27]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

2024
[28]

Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036, 2023

arXiv 2023
[29]

L-mtp: Leap multi-token prediction beyond adjacent context for large language models.Advances in Neural Information Processing Systems, 38:102569–102600, 2026

Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, and Tat-Seng Chua. L-mtp: Leap multi-token prediction beyond adjacent context for large language models.Advances in Neural Information Processing Systems, 38:102569–102600, 2026

2026
[30]

A dynamic llm-powered agent network for task-oriented agent collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling, 2024

2024
[31]

Computing approximate equilibria in sequential adversarial games by exploitability descent.arXiv preprint arXiv:1903.05614, 2019

Edward Lockhart, Marc Lanctot, Julien Pérolat, Jean-Baptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent.arXiv preprint arXiv:1903.05614, 2019

arXiv 1903
[32]

Introducing wide research, 2025

Manus. Introducing wide research, 2025. URLhttps://manus.im/blog/introducing-wide-research

2025
[33]

R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve.Proceedings of the National Academy of Sciences, 121(41):e2322420121, 2024

2024
[34]

When a language model is optimized for reasoning, does it still show embers of autoregression? an analysis of openai o1.arXiv preprint arXiv:2410.01792, 2024

R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths. When a language model is optimized for reasoning, does it still show embers of autoregression? an analysis of openai o1.arXiv preprint arXiv:2410.01792, 2024

arXiv 2024
[35]

Mike A. Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Hyeob Shin, Thomas Walshe, Estefany Kelly Buchanan, Junhong Shen, Guanghao Ye, Hao Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen,...

Pith/arXiv arXiv 2026
[36]

Fictitious play property for games with identical interests.Journal of economic theory, 68 (1):258–265, 1996

Dov Monderer and Lloyd S Shapley. Fictitious play property for games with identical interests.Journal of economic theory, 68 (1):258–265, 1996

1996
[37]

Non-cooperative games.ANNALS OF MATHEMATICS, 54(2), 1951

John Nash. Non-cooperative games.ANNALS OF MATHEMATICS, 54(2), 1951

1951
[38]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

2024
[39]

Escalation risks from language models in military and diplomatic decision-making

Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider. Escalation risks from language models in military and diplomatic decision-making. InProceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 836–898, 2024. 11 APREPRINT

2024
[40]

Mixed-integer programming methods for finding nash equilibria

Tuomas Sandholm, Andrew Gilpin, and Vincent Conitzer. Mixed-integer programming methods for finding nash equilibria. In AAAI Conference on Artificial Intelligence, 2005

2005
[41]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[42]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[43]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026
[44]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

2026
[45]

Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Pith/arXiv arXiv 2025
[46]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Y Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Representations, volume 2025, pages 33944–33963, 2025

2025
[47]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

2024
[48]

Boosting llm agents with recursive contemplation for effective deception handling

Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Boosting llm agents with recursive contemplation for effective deception handling. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9909–9953, 2024

2024
[49]

From bits to boardrooms: A cutting-edge multi-agent llm framework for business excellence

Zihao Wang and Junming Zhang. From bits to boardrooms: A cutting-edge multi-agent llm framework for business excellence. arXiv preprint arXiv:2508.15447, 2025

arXiv 2025
[50]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[51]

Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv: 2504.12516, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeff Han, Isa Fulford, Hyung Won Chung, Alexandre Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv: 2504.12516, 2025

Pith/arXiv arXiv 2025
[52]

Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities

Alex Wilf, Sihyun Lee, Paul Pu Liang, and Louis-Philippe Morency. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8292–8308, 2024

2024
[53]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[54]

Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, 2023

2023
[55]

Tacomas: Test-time co-evolution of topology and capability in llm-based multi-agent systems.arXiv preprint arXiv:2605.09539, 2026

Chen Xu, Yicheng Hu, Ruizi Wang, Xinyu Lin, Wenjie Wang, Dongrui Liu, and Fuli Feng. Tacomas: Test-time co-evolution of topology and capability in llm-based multi-agent systems.arXiv preprint arXiv:2605.09539, 2026

Pith/arXiv arXiv 2026
[56]

A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594, 2025

Renjun Xu and Jingwen Peng. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594, 2025

arXiv 2025
[57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[58]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.Advances in Neural Information Processing Systems, 38: 107309–107336, 2026

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.Advances in Neural Information Processing Systems, 38: 107309–107336, 2026

2026
[59]

Quantevolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569, 2025

Junhyeog Yun, Hyoun Jun Lee, and Insu Jeon. Quantevolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569, 2025

arXiv 2025
[60]

Steering no-regret learners to a desired equilibrium

Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to a desired equilibrium. arXiv preprint arXiv:2306.05221, 2023

arXiv 2023
[61]

Retailbench: Evaluating long-horizon autonomous decision- making and strategy stability of llm agents in realistic retail environments.arXiv preprint arXiv:2603.16453, 2026

Linghua Zhang, Jun Wang, Jingtong Wu, and Zhisong Zhang. Retailbench: Evaluating long-horizon autonomous decision- making and strategy stability of llm agents in realistic retail environments.arXiv preprint arXiv:2603.16453, 2026

Pith/arXiv arXiv 2026
[62]

Llm as a mastermind: A survey of strategic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strategic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024. 12 APREPRINT

arXiv 2024
[63]

K-level reasoning: Establishing higher order beliefs in large language models for strategic reasoning

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Yan Xia, Man Lan, and Furu Wei. K-level reasoning: Establishing higher order beliefs in large language models for strategic reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Paper...

2025
[64]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

2024
[65]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

2026
[66]

no winner

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. 13 APREPRINT Table 3: Scenarios Description. Scenario Description Strategic Games TicTacToe A two-player game on a 3×3 grid; players alter...

2024

[1] [1]

Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models.Nature Human Behaviour, 9(7):1380–1390, 2025

2025

[2] [2]

Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

Axel Backlund and Lukas Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

arXiv 2025

[3] [3]

Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

Suma Bailis, Jane Friedhoff, and Feiyang Chen. Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

arXiv 2024

[4] [4]

How well can llms negotiate? negotiationarena platform and analysis

Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can llms negotiate? negotiationarena platform and analysis. InInternational Conference on Machine Learning, pages 3935–3951. PMLR, 2024

2024

[5] [5]

Iterative solution of games by fictitious play.Act

George W Brown. Iterative solution of games by fictitious play.Act. Anal. Prod Allocation, 13(1):374, 1951

1951

[6] [6]

Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, and Kyle Richardson. Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena.arXiv preprint arXiv:2310.05746, 2023

arXiv 2023

[7] [7]

Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments

Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, and Lijie Wen. Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13055–13077, 2024

2024

[8] [8]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InInternational Conference on Learning Representations, volume 2024, pages 20094–20136, 2024

2024

[9] [9]

Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, and Arjun Yadav. Gamebench: Evaluating strategic reasoning abilities of llm agents.arXiv preprint arXiv:2406.06613, 2024

arXiv 2024

[10] [10]

Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models.arXiv preprint arXiv:2407.07086, 2024

Logan Cross, Violet Xiang, Agam Bhatia, Daniel LK Yamins, and Nick Haber. Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models.arXiv preprint arXiv:2407.07086, 2024

arXiv 2024

[11] [11]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

Pith/arXiv arXiv 2025

[12] [12]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024

[13] [13]

Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations.Advances in Neural Information Processing Systems, 37:28219–28253, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations.Advances in Neural Information Processing Systems, 37:28219–28253, 2024

2024

[14] [14]

Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning.Science, 378(6624):1067–1074, 2022

2022

[15] [15]

Theory of mind.Current biology, 15(17):R644–R645, 2005

Chris Frith and Uta Frith. Theory of mind.Current biology, 15(17):R644–R645, 2005

2005

[16] [16]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[17] [17]

Agentgroupchat-v2: Divide-and-conquer is what llm-based multi-agent system need.arXiv preprint arXiv:2506.15451, 2025

Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, et al. Agentgroupchat-v2: Divide-and-conquer is what llm-based multi-agent system need.arXiv preprint arXiv:2506.15451, 2025

arXiv 2025

[18] [18]

Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4.arXiv preprint arXiv:2309.17277, 2023

Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4.arXiv preprint arXiv:2309.17277, 2023

arXiv 2023

[19] [19]

A general theory of equilibrium selection in games.MIT Press Books, 1, 1988

John C Harsanyi and Reinhard Selten. A general theory of equilibrium selection in games.MIT Press Books, 1, 1988. 10 APREPRINT

1988

[20] [20]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024

2024

[21] [21]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

2023

[22] [22]

Fast algorithms for finding randomized strategies in game trees

Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. Fast algorithms for finding randomized strategies in game trees. InSymposium on the Theory of Computing, 1994

1994

[23] [23]

Extensive games and the problem of information.Contributions to the Theory of Games, 2(28):193–216, 1953

Harold W Kuhn. Extensive games and the problem of information.Contributions to the Theory of Games, 2(28):193–216, 1953

1953

[24] [24]

Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems.Advances in Neural Information Processing Systems, 37:53418–53437, 2024

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, and Caiwen Ding. Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems.Advances in Neural Information Processing Systems, 37:53418–53437, 2024

2024

[25] [25]

Lemke and Jr

Carlton E. Lemke and Jr. Joseph T. Howson. Equilibrium points of bimatrix games.Journal of The Society for Industrial and Applied Mathematics, 12:413–423, 1964

1964

[26] [26]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023

[27] [27]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

2024

[28] [28]

Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036, 2023

Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. Avalonbench: Evaluating llms playing the game of avalon.arXiv preprint arXiv:2310.05036, 2023

arXiv 2023

[29] [29]

L-mtp: Leap multi-token prediction beyond adjacent context for large language models.Advances in Neural Information Processing Systems, 38:102569–102600, 2026

Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, and Tat-Seng Chua. L-mtp: Leap multi-token prediction beyond adjacent context for large language models.Advances in Neural Information Processing Systems, 38:102569–102600, 2026

2026

[30] [30]

A dynamic llm-powered agent network for task-oriented agent collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling, 2024

2024

[31] [31]

Computing approximate equilibria in sequential adversarial games by exploitability descent.arXiv preprint arXiv:1903.05614, 2019

Edward Lockhart, Marc Lanctot, Julien Pérolat, Jean-Baptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent.arXiv preprint arXiv:1903.05614, 2019

arXiv 1903

[32] [32]

Introducing wide research, 2025

Manus. Introducing wide research, 2025. URLhttps://manus.im/blog/introducing-wide-research

2025

[33] [33]

R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve.Proceedings of the National Academy of Sciences, 121(41):e2322420121, 2024

2024

[34] [34]

When a language model is optimized for reasoning, does it still show embers of autoregression? an analysis of openai o1.arXiv preprint arXiv:2410.01792, 2024

R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths. When a language model is optimized for reasoning, does it still show embers of autoregression? an analysis of openai o1.arXiv preprint arXiv:2410.01792, 2024

arXiv 2024

[35] [35]

Mike A. Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Hyeob Shin, Thomas Walshe, Estefany Kelly Buchanan, Junhong Shen, Guanghao Ye, Hao Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen,...

Pith/arXiv arXiv 2026

[36] [36]

Fictitious play property for games with identical interests.Journal of economic theory, 68 (1):258–265, 1996

Dov Monderer and Lloyd S Shapley. Fictitious play property for games with identical interests.Journal of economic theory, 68 (1):258–265, 1996

1996

[37] [37]

Non-cooperative games.ANNALS OF MATHEMATICS, 54(2), 1951

John Nash. Non-cooperative games.ANNALS OF MATHEMATICS, 54(2), 1951

1951

[38] [38]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174–15186, 2024

2024

[39] [39]

Escalation risks from language models in military and diplomatic decision-making

Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider. Escalation risks from language models in military and diplomatic decision-making. InProceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 836–898, 2024. 11 APREPRINT

2024

[40] [40]

Mixed-integer programming methods for finding nash equilibria

Tuomas Sandholm, Andrew Gilpin, and Vincent Conitzer. Mixed-integer programming methods for finding nash equilibria. In AAAI Conference on Artificial Intelligence, 2005

2005

[41] [41]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[42] [42]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[43] [43]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

Pith/arXiv arXiv 2026

[44] [44]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

2026

[45] [45]

Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Pith/arXiv arXiv 2025

[46] [46]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Y Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Representations, volume 2025, pages 33944–33963, 2025

2025

[47] [47]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

2024

[48] [48]

Boosting llm agents with recursive contemplation for effective deception handling

Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Boosting llm agents with recursive contemplation for effective deception handling. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9909–9953, 2024

2024

[49] [49]

From bits to boardrooms: A cutting-edge multi-agent llm framework for business excellence

Zihao Wang and Junming Zhang. From bits to boardrooms: A cutting-edge multi-agent llm framework for business excellence. arXiv preprint arXiv:2508.15447, 2025

arXiv 2025

[50] [50]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[51] [51]

Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv: 2504.12516, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeff Han, Isa Fulford, Hyung Won Chung, Alexandre Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv: 2504.12516, 2025

Pith/arXiv arXiv 2025

[52] [52]

Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities

Alex Wilf, Sihyun Lee, Paul Pu Liang, and Louis-Philippe Morency. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8292–8308, 2024

2024

[53] [53]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024

[54] [54]

Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, 2023

2023

[55] [55]

Tacomas: Test-time co-evolution of topology and capability in llm-based multi-agent systems.arXiv preprint arXiv:2605.09539, 2026

Chen Xu, Yicheng Hu, Ruizi Wang, Xinyu Lin, Wenjie Wang, Dongrui Liu, and Fuli Feng. Tacomas: Test-time co-evolution of topology and capability in llm-based multi-agent systems.arXiv preprint arXiv:2605.09539, 2026

Pith/arXiv arXiv 2026

[56] [56]

A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594, 2025

Renjun Xu and Jingwen Peng. A comprehensive survey of deep research: Systems, methodologies, and applications.arXiv preprint arXiv:2506.12594, 2025

arXiv 2025

[57] [57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[58] [58]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.Advances in Neural Information Processing Systems, 38: 107309–107336, 2026

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.Advances in Neural Information Processing Systems, 38: 107309–107336, 2026

2026

[59] [59]

Quantevolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569, 2025

Junhyeog Yun, Hyoun Jun Lee, and Insu Jeon. Quantevolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569, 2025

arXiv 2025

[60] [60]

Steering no-regret learners to a desired equilibrium

Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to a desired equilibrium. arXiv preprint arXiv:2306.05221, 2023

arXiv 2023

[61] [61]

Retailbench: Evaluating long-horizon autonomous decision- making and strategy stability of llm agents in realistic retail environments.arXiv preprint arXiv:2603.16453, 2026

Linghua Zhang, Jun Wang, Jingtong Wu, and Zhisong Zhang. Retailbench: Evaluating long-horizon autonomous decision- making and strategy stability of llm agents in realistic retail environments.arXiv preprint arXiv:2603.16453, 2026

Pith/arXiv arXiv 2026

[62] [62]

Llm as a mastermind: A survey of strategic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strategic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024. 12 APREPRINT

arXiv 2024

[63] [63]

K-level reasoning: Establishing higher order beliefs in large language models for strategic reasoning

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Yan Xia, Man Lan, and Furu Wei. K-level reasoning: Establishing higher order beliefs in large language models for strategic reasoning. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Paper...

2025

[64] [64]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

2024

[65] [65]

An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

Weike Zhao, Chaoyi Wu, Yanjie Fan, Pengcheng Qiu, Xiaoman Zhang, Yuze Sun, Xiao Zhou, Shuju Zhang, Yu Peng, Yanfeng Wang, et al. An agentic system for rare disease diagnosis with traceable reasoning.Nature, pages 1–10, 2026

2026

[66] [66]

no winner

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. InForty-first International Conference on Machine Learning, 2024. 13 APREPRINT Table 3: Scenarios Description. Scenario Description Strategic Games TicTacToe A two-player game on a 3×3 grid; players alter...

2024