TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Christopher Olsen; Paulo Salem; Prerit Saxena; Rafael Barcelos; Robert Sim; Yi Ding

arxiv: 2507.09788 · v2 · submitted 2025-07-13 · 💻 cs.MA · cs.AI· cs.CL· cs.HC

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Paulo Salem , Robert Sim , Christopher Olsen , Prerit Saxena , Rafael Barcelos , Yi Ding This is my paper

Pith reviewed 2026-05-19 04:50 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CLcs.HC

keywords multiagent simulationpersona modelingLLM agentsbehavioral simulationsocial simulationAI toolkitagent-based modeling

0 comments

The pith

TinyTroupe lets users define detailed personas and run LLM-driven simulations to solve individual or group behavioral problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinyTroupe as a Python toolkit that supports fine-grained persona specifications covering attributes like nationality, age, occupation, personality, beliefs, and behaviors. These personas are then driven by multiple LLM-based mechanisms to generate actions and interactions in multiagent setups. The result is a practical way to formulate and explore behavioral questions at scale, such as brainstorming sessions or market research, without assembling real participants. The work also supplies experimentation support, population sampling, and basic validation tools to make such simulations usable for behavioral studies. The authors present the components through concrete examples and report both quantitative and qualitative assessments of their performance.

Core claim

TinyTroupe enables the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution through detailed persona definitions and LLM-driven mechanisms.

What carries the argument

Detailed persona specifications combined with LLM-powered control mechanisms that generate and steer agent behaviors and interactions.

If this is right

Users can model and iterate on scenarios such as brainstorming or market research sessions using only programmatic persona definitions.
The toolkit supplies built-in support for population sampling, experimentation, and integrated validation of simulation outputs.
The conceptual approach can be partially or fully adopted in other multiagent or simulation frameworks beyond the provided Python library.
Quantitative and qualitative evaluations demonstrate both the possibilities and the current limitations of the persona-driven approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the toolkit with real-world demographic data sources could allow more statistically representative population sampling.
The same persona mechanism might be applied to test hypotheses from social science by varying specific attributes across simulated groups.
Because the simulations rely on LLM conditioning, systematic bias audits on generated behaviors would be a natural next measurement step.

Load-bearing premise

That LLM outputs conditioned on the supplied persona attributes will produce sufficiently realistic and consistent human-like behavior for the intended simulation use cases.

What would settle it

Run a controlled scenario with TinyTroupe personas and compare their responses side-by-side with actual human participants performing the same task; significant divergence in consistency, realism, or decision patterns would undermine the central claim.

Figures

Figures reproduced from arXiv: 2507.09788 by Christopher Olsen, Paulo Salem, Prerit Saxena, Rafael Barcelos, Robert Sim, Yi Ding.

**Figure 2.** Figure 2: As expected, families with children largely reject [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TinyTroupe is a practical open-source toolkit that integrates detailed personas, sampling, and some validation for LLM multi-agent simulations, but its usefulness for real behavioral problems rests on untested assumptions about output realism.

read the letter

The main point for you is that this paper ships a usable Python library that lets you define agents with fine-grained attributes like age, nationality, personality, beliefs, and behaviors, then run them in groups with LLM control and some built-in experimentation features. It targets gaps the authors identify in other MAS tools, such as population sampling and integrated validation support, and backs that up with working examples for brainstorming sessions and market research scenarios plus selected quantitative and qualitative checks on output properties like coherence and coverage. The open-source release on GitHub makes it straightforward to inspect and extend the code, which is a concrete plus for anyone who wants to experiment with persona-driven simulations rather than start from scratch. What the work does well is the engineering integration: it combines those elements into one package with programmatic hooks, and the examples clarify how the pieces fit together for practical tasks at individual or group level. The evaluations, while limited, at least surface some trade-offs instead of claiming perfect results. The soft spot is the realism assumption. The paper's claim that this setup provides effective means to solve behavioral problems depends on persona-conditioned LLM outputs producing sufficiently consistent and human-like behavior, yet the reported checks appear to stay internal to the system rather than comparing outputs against empirical human data or external benchmarks. That leaves open the possibility that model artifacts like bias amplification or low variance could limit reliability for the stated use cases, even if the API itself is clean. This is aimed at applied folks doing early product testing, social simulations, or behavioral prototyping who need a ready tool rather than a theoretical advance. A reader building or evaluating simulation setups would find the examples and code useful. It deserves a serious referee because the contribution is grounded in implementation details and addresses documented gaps in existing libraries, even if revisions should strengthen the external validation angle.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TinyTroupe, a Python-based open-source toolkit for LLM-powered multi-agent simulations that supports detailed persona definitions (nationality, age, occupation, personality, beliefs, behaviors) along with population sampling, programmatic control mechanisms, experimentation facilities, and validation support. It positions the toolkit as addressing gaps in prior MAS libraries and demonstrates its use via working examples such as brainstorming and market research sessions. The central claim is that these features enable concise formulation of behavioral problems at individual or group levels and provide effective means for their solution, supported by quantitative and qualitative evaluations of selected aspects.

Significance. If the realism of persona-conditioned LLM outputs holds for the target use cases, TinyTroupe would offer a meaningful advance over existing MAS tools by enabling finer-grained, controllable behavioral simulations suitable for applications in market research, social simulation, and group problem-solving. Notable strengths include the open-source release, the dual use of working examples to illustrate components while demonstrating practical utility, and the framing as a conceptual contribution adaptable to other contexts.

major comments (1)

[Evaluations section] Evaluations section: The quantitative and qualitative evaluations focus on internal properties such as output coherence and scenario coverage. This is load-bearing for the abstract's claim of providing 'effective means for their solution' in practical settings (e.g., market research), because without external grounding against empirical human behavior data, LLM artifacts (recency bias, stereotype amplification, low variance) could render the simulations ineffective even if the API and persona mechanisms are well-designed.

minor comments (2)

[Abstract] Abstract: The phrase 'among other key capabilities' is imprecise; explicitly listing the full set of addressed deficiencies would improve clarity.
[Examples section] Examples section: Ensure all code snippets and figures are numbered and directly referenced in the surrounding text to aid readers in following the component descriptions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation of minor revision. We address the single major comment below and have revised the manuscript to better contextualize our evaluation claims and limitations.

read point-by-point responses

Referee: [Evaluations section] Evaluations section: The quantitative and qualitative evaluations focus on internal properties such as output coherence and scenario coverage. This is load-bearing for the abstract's claim of providing 'effective means for their solution' in practical settings (e.g., market research), because without external grounding against empirical human behavior data, LLM artifacts (recency bias, stereotype amplification, low variance) could render the simulations ineffective even if the API and persona mechanisms are well-designed.

Authors: We agree that the evaluations focus on internal properties and that the absence of direct comparison to empirical human behavior data limits strong claims about practical effectiveness in settings such as market research. LLM artifacts are a genuine concern. In the revised manuscript we have expanded the Evaluations section with an explicit discussion of these limitations, including recency bias, stereotype amplification, and output variance. We have also revised the abstract and introduction to state that TinyTroupe provides mechanisms for concise formulation and exploration of behavioral problems, with effectiveness subject to the underlying LLM and to user-led validation. These changes temper the original wording while preserving the toolkit's contribution. Comprehensive external grounding against human data remains an important open research direction beyond the scope of this paper. revision: yes

Circularity Check

0 steps flagged

No circularity: toolkit paper with no derivations or self-referential reductions

full rationale

The paper presents TinyTroupe as an open-source Python toolkit for LLM-powered multiagent persona simulations, focusing on implementation details, persona definitions, control mechanisms, working examples (e.g., brainstorming and market research), and selected quantitative/qualitative evaluations of internal properties like coherence. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described content. Claims rest on the toolkit's design and demonstrations rather than any mathematical structure that could reduce to its inputs by construction. No self-citations are load-bearing for a central premise, as the contribution is conceptual and practical implementation rather than a theorem or predictive model. The paper is self-contained against external benchmarks in the sense that its value is in provided code and examples, not in any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a software toolkit paper, the work introduces no mathematical axioms, free parameters fitted to data, or new physical entities; the central contribution is engineering and API design rather than derivation from postulates.

pith-pipeline@v0.9.0 · 5798 in / 1042 out tokens · 31484 ms · 2026-05-19T04:50:44.507240+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TinyTroupe enables the concise formulation of behavioral problems... through detailed persona definitions and LLM-driven mechanisms.
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Persona-based: enables rich, fine-grained definitions of personas (age, occupation, personality...); Action generation, monitoring and correction.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
cs.AI 2026-05 unverdicted novelty 7.0

ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed...
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

PersonaArena is a dynamic simulation framework that constructs persona banks from social data and uses multi-agent debating judges to evaluate and enhance persona-level role-playing in LLMs.
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data
cs.AI 2026-04 unverdicted novelty 6.0

Chorus generates realistic deliberation discussions via LLM agents with memory and Poisson-timed participation, validated by 30 experts on realism, coherence, and utility.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review arXiv 2025
[2]

A survey of programming languages and platforms for multi-agent systems

Rafael Bordini, Lars Braubach, Mehdi Dastani, Amal El Fallah Seghrouchni, Jorge Gomez-Sanz, Joao Leite, Gregory O’Hare, Alexander Pokahr, and Alessandro Ricci. A survey of programming languages and platforms for multi-agent systems. Informatica, 30(1):33–44, 2006

work page 2006
[3]

Michael E. Bratman. Intention, Plans, and Practical Reason . 1987

work page 1987
[4]

Why do multi-agent llm systems fail? arXiv e-prints, pages arXiv–2503, 2025

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, et al. Why do multi-agent llm systems fail? arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[5]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 8Being an open-source project, the community is also welcome to propose such ad- ditions, as well as new use cases to challenge the current li...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

crewAI Inc. Crewai. https://github.com/crewAIInc/crewAI, 2024. Accessed: 2024-11-15

work page 2024
[7]

Epistein and Robert Axtell

Joshua M. Epistein and Robert Axtell. Growing Artificial Societies: Social Science from the Bottom Up . The MIT Press, 1996

work page 1996
[8]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, and Jingren Zhou. Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis. arXiv preprint arXiv:2402.09742, 2024

work page arXiv 2024
[9]

Multi-agent systems: an introduction to distributed artificial intelligence, volume 1

Jacques Ferber and Gerhard Weiss. Multi-agent systems: an introduction to distributed artificial intelligence, volume 1. Addison-wesley Reading, 1999

work page 1999
[10]

Platforms and methods for agent-based modeling

Nigel Gilbert and Steven Bankers. Platforms and methods for agent-based modeling. Proceedings of the National Academy of Sciences of the United States , 99(Supplement 3), 2002

work page 2002
[11]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a- judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi- agents: a survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , IJCAI ’24, 2024

work page 2024
[13]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[14]

MASON: A new multi-agent simulation toolkit

Sean Luke, Claudio Cioffi-Revilla, Liviu Panait, and Keith Sullivan. MASON: A new multi-agent simulation toolkit. 2004. http://cs.gmu.edu/~eclab/projects/ mason/

work page 2004
[15]

Multi-agent systems and simulation: A survey from the agent commu-nity’s perspective

Fabien Michel, Jacques Ferber, and Alexis Drogoul. Multi-agent systems and simulation: A survey from the agent commu-nity’s perspective. In Multi-Agent Systems, pages 17–66. CRC Press, 2018

work page 2018
[16]

Minar, R

N. Minar, R. Burkhart, C. Langton, and M. Askenazi. The Swarm simulation system: A toolkit for building multi-agent simulations. 1996. Working Paper 96-06-042

work page 1996
[17]

Michael North, Nick Collier, and Jerry R. Vos. Experiences creating three imple- mentations of the Repast agent modeling toolkit. ACM Transactions on Modeling and Computer Simulation, 16(1):1–25, 2006. http://repast.sourceforge.net/

work page 2006
[18]

ChatGPT (Nov 2024 version), 2024

OpenAI. ChatGPT (Nov 2024 version), 2024. [Large language model]

work page 2024
[19]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User In- terface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[21]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15174–15186, 2024

work page 2024
[23]

Bdi agents: from theory to practice

Anand S Rao, Michael P Georgeff, et al. Bdi agents: from theory to practice. In Icmas, volume 95, pages 312–319, 1995

work page 1995
[24]

The case for experiment-oriented computing

Paulo Salem. The case for experiment-oriented computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 719–723, 2018

work page 2018
[25]

Personagym: Evaluating persona agents and llms

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Mu- rahari. Personagym: Evaluating persona agents and llms. arXiv preprint arXiv:2407.18416, 2024

work page arXiv 2024
[26]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36:8634–8652, 2023

work page 2023
[27]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[28]

Cognitive Psychology: The Basics

Sandie Taylor and Lance Workman. Cognitive Psychology: The Basics . Routledge, 2021

work page 2021
[29]

Modélisation et simulation à base d’agents: exemples commentés, outils informatiques et questions théoriques

Jean-Pierre Treuil, Alexis Drogoul, and Jean-Daniel Zucker. Modélisation et simulation à base d’agents: exemples commentés, outils informatiques et questions théoriques. Dunod, 2008

work page 2008
[30]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science , 18(6):186345, 2024

work page 2024
[31]

Wilensky

U. Wilensky. NetLogo. 1999

work page 1999
[32]

An Introduction to MultiAgent Systems

Michael Wooldridge. An Introduction to MultiAgent Systems . Wiley Publishing, 2nd edition, 2009

work page 2009
[33]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next- gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Oasis: Open agents social interaction simulations on one million agents

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agents social interaction simulations on one million agents. arXiv preprint arXiv:2411.11581, 2024

work page arXiv 2024
[35]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[36]

Personality alignment of large language models

Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. Personality alignment of large language models. arXiv preprint arXiv:2408.11779, 2024

work page arXiv 2024

[1] [1]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint:2503.08679, 2025

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review arXiv 2025

[2] [2]

A survey of programming languages and platforms for multi-agent systems

Rafael Bordini, Lars Braubach, Mehdi Dastani, Amal El Fallah Seghrouchni, Jorge Gomez-Sanz, Joao Leite, Gregory O’Hare, Alexander Pokahr, and Alessandro Ricci. A survey of programming languages and platforms for multi-agent systems. Informatica, 30(1):33–44, 2006

work page 2006

[3] [3]

Michael E. Bratman. Intention, Plans, and Practical Reason . 1987

work page 1987

[4] [4]

Why do multi-agent llm systems fail? arXiv e-prints, pages arXiv–2503, 2025

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, et al. Why do multi-agent llm systems fail? arXiv e-prints, pages arXiv–2503, 2025

work page 2025

[5] [5]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 8Being an open-source project, the community is also welcome to propose such ad- ditions, as well as new use cases to challenge the current li...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

crewAI Inc. Crewai. https://github.com/crewAIInc/crewAI, 2024. Accessed: 2024-11-15

work page 2024

[7] [7]

Epistein and Robert Axtell

Joshua M. Epistein and Robert Axtell. Growing Artificial Societies: Social Science from the Bottom Up . The MIT Press, 1996

work page 1996

[8] [8]

Ai hospital: Benchmarking large language models in a multi-agent medical interaction simulator,

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, and Jingren Zhou. Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis. arXiv preprint arXiv:2402.09742, 2024

work page arXiv 2024

[9] [9]

Multi-agent systems: an introduction to distributed artificial intelligence, volume 1

Jacques Ferber and Gerhard Weiss. Multi-agent systems: an introduction to distributed artificial intelligence, volume 1. Addison-wesley Reading, 1999

work page 1999

[10] [10]

Platforms and methods for agent-based modeling

Nigel Gilbert and Steven Bankers. Platforms and methods for agent-based modeling. Proceedings of the National Academy of Sciences of the United States , 99(Supplement 3), 2002

work page 2002

[11] [11]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a- judge. arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi- agents: a survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , IJCAI ’24, 2024

work page 2024

[13] [13]

Agent hospital: A simulacrum of hospital with evolvable medical agents,

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, and Yang Liu. Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024

[14] [14]

MASON: A new multi-agent simulation toolkit

Sean Luke, Claudio Cioffi-Revilla, Liviu Panait, and Keith Sullivan. MASON: A new multi-agent simulation toolkit. 2004. http://cs.gmu.edu/~eclab/projects/ mason/

work page 2004

[15] [15]

Multi-agent systems and simulation: A survey from the agent commu-nity’s perspective

Fabien Michel, Jacques Ferber, and Alexis Drogoul. Multi-agent systems and simulation: A survey from the agent commu-nity’s perspective. In Multi-Agent Systems, pages 17–66. CRC Press, 2018

work page 2018

[16] [16]

Minar, R

N. Minar, R. Burkhart, C. Langton, and M. Askenazi. The Swarm simulation system: A toolkit for building multi-agent simulations. 1996. Working Paper 96-06-042

work page 1996

[17] [17]

Michael North, Nick Collier, and Jerry R. Vos. Experiences creating three imple- mentations of the Repast agent modeling toolkit. ACM Transactions on Modeling and Computer Simulation, 16(1):1–25, 2006. http://repast.sourceforge.net/

work page 2006

[18] [18]

ChatGPT (Nov 2024 version), 2024

OpenAI. ChatGPT (Nov 2024 version), 2024. [Large language model]

work page 2024

[19] [19]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User In- terface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023

[21] [21]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people. arXiv preprint arXiv:2411.10109, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 15174–15186, 2024

work page 2024

[23] [23]

Bdi agents: from theory to practice

Anand S Rao, Michael P Georgeff, et al. Bdi agents: from theory to practice. In Icmas, volume 95, pages 312–319, 1995

work page 1995

[24] [24]

The case for experiment-oriented computing

Paulo Salem. The case for experiment-oriented computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages 719–723, 2018

work page 2018

[25] [25]

Personagym: Evaluating persona agents and llms

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Mu- rahari. Personagym: Evaluating persona agents and llms. arXiv preprint arXiv:2407.18416, 2024

work page arXiv 2024

[26] [26]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36:8634–8652, 2023

work page 2023

[27] [27]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024

work page 2024

[28] [28]

Cognitive Psychology: The Basics

Sandie Taylor and Lance Workman. Cognitive Psychology: The Basics . Routledge, 2021

work page 2021

[29] [29]

Modélisation et simulation à base d’agents: exemples commentés, outils informatiques et questions théoriques

Jean-Pierre Treuil, Alexis Drogoul, and Jean-Daniel Zucker. Modélisation et simulation à base d’agents: exemples commentés, outils informatiques et questions théoriques. Dunod, 2008

work page 2008

[30] [30]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science , 18(6):186345, 2024

work page 2024

[31] [31]

Wilensky

U. Wilensky. NetLogo. 1999

work page 1999

[32] [32]

An Introduction to MultiAgent Systems

Michael Wooldridge. An Introduction to MultiAgent Systems . Wiley Publishing, 2nd edition, 2009

work page 2009

[33] [33]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next- gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Oasis: Open agents social interaction simulations on one million agents

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, et al. Oasis: Open agents social interaction simulations on one million agents. arXiv preprint arXiv:2411.11581, 2024

work page arXiv 2024

[35] [35]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023

[36] [36]

Personality alignment of large language models

Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. Personality alignment of large language models. arXiv preprint arXiv:2408.11779, 2024

work page arXiv 2024