pith. machine review for the scientific record. sign in

arxiv: 2605.06320 · v1 · submitted 2026-05-07 · 💻 cs.MA · cs.AI· cs.CL

Recognition: unknown

Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

Elizabeth Mieczkowski , Alexander Ku , Tiwalayo Eisape , Dilip Arumugam , John Matters , Katherine M. Collins , Ilia Sucholutsky , Thomas L. Griffiths

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:40 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL
keywords LLM agent teamstask graphsmulti-agent coordinationadaptive task decompositionlanguage model collaborationcoordination efficiencydistributed systems
0
0 comments X

The pith

LLM agent teams maintain a shared evolving task graph to reduce token use, time, and conflicts while matching accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LATTE as a coordination approach for teams of language model agents that jointly build and update a shared graph tracking task dependencies, assignments, and progress. This replaces both rigid pre-set structures and fully open interactions that often lead to duplicated work or clashes. The method draws on ideas from distributed computing to let agents operate with only partial views of the overall state yet keep the plan consistent. Experiments across tasks and models show drops in resources spent and fewer failures such as file conflicts, without any drop in the quality of the final results. A reader would care because many practical uses of agent teams involve ongoing collaboration where fixed plans break and free-form talk wastes effort.

Core claim

In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, the approach reduces token usage, wall-clock time, communication, and coordination failures such as file conflicts and redundant outputs, while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static

What carries the argument

The shared evolving coordination graph that records sub-task dependencies, agent assignments, and progress states so agents can update it together under partial information.

If this is right

  • Agents can discover new tasks and reallocate work dynamically instead of following a fixed plan.
  • Coordination failures such as file conflicts and redundant outputs become less frequent.
  • Overall token consumption and wall-clock time decrease across varied base models and tasks.
  • Final accuracy stays at or above the level of fixed hierarchies and unstructured teams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar shared-state mechanisms could help agent teams that must incorporate external tools or data sources mid-task.
  • The approach may scale best on longer projects where the cost of early miscoordination grows large.
  • Explicit dependency tracking might reduce the need for verbose natural-language messages between agents.

Load-bearing premise

Language model agents can reliably and consistently collaborate to build and maintain the shared task graph without creating new inconsistencies or adding too much extra overhead.

What would settle it

Run LATTE on a multi-step collaborative coding task and check whether the maintained graph ever allows two agents to edit the same file at once or repeat the same sub-task output.

Figures

Figures reproduced from arXiv: 2605.06320 by Alexander Ku, Dilip Arumugam, Elizabeth Mieczkowski, Ilia Sucholutsky, John Matters, Katherine M. Collins, Thomas L. Griffiths, Tiwalayo Eisape.

Figure 1
Figure 1. Figure 1: LATTE. Most existing LLM team designs are either highly structured (a. pipeline systems; b. Leader-Worker hierarchies) or unstructured (c. decentralized teams). (d) LATTE provides teams with a dynamic coordination graph that they collectively maintain and adapt. For example, in a data analysis task, the Lead initializes G0 and assigns Worker 1 to preprocess. As Worker 1 learns about the data, it spawns par… view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency-accuracy tradeoff. A) LATTE achieves greater efficiency than alternative frameworks. We measure expected cost (total tokens or wall-clock time weighted by trial completion rate) to account for runs in which teams fail to terminate. B) LATTE achieves higher task success with lower token consumption (normalized across tasks) on the accuracy-vs-token-cost Pareto frontier. Task 1: Exploratory Data A… view at source ↗
Figure 3
Figure 3. Figure 3: LLM teams successfully utilize LATTE. A) LATTE teams emergently call all graph operators across rounds, demonstrating full utilization of the coordination toolkit. B) Dynamic coordination graphs grow larger than static ones over time. This reflects richer and more fine-grained understanding of which subtasks need to be executed, offering more opportunities for Workers to be deployed. In contrast, a smaller… view at source ↗
Figure 4
Figure 4. Figure 4: LATTE improves coordination. (A) Overwrites: agents overwriting a prior agent’s work in a later round. (B) Concurrent writes: two agents simultaneously writing to the same function. (C) Wasted output: characters written that do not appear in the final output. (D) Communication overhead: number and volume of messages exchanged. (E) Inactivity: proportion of rounds with an agent suppressed. LATTE reduces A–D… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LATTE (Language Agent Teams for Task Evolution), a coordination framework for LLM-based agent teams. Agents collaboratively build and maintain a shared, evolving task graph encoding sub-task dependencies, assignments, and progress states. Inspired by distributed-systems principles for partial observability, the protocol aims to enable dynamic work allocation and adaptation. Empirical results across collaborative tasks and base models claim that LATTE reduces token usage, wall-clock time, communication volume, and coordination failures (e.g., file conflicts, redundant outputs) while matching or exceeding accuracy of baselines including MetaGPT, fully decentralized teams, Leader-Worker hierarchies, and static decompositions.

Significance. If the empirical claims hold under rigorous controls, LATTE offers a practical middle path between rigid a-priori structures and unstructured teams, potentially improving scalability and resource efficiency in multi-agent LLM deployments. The distributed-systems analogy and focus on measurable coordination failures provide a concrete, falsifiable protocol that could influence subsequent work on agent orchestration.

major comments (2)
  1. [Abstract / Experimental Evaluation] The central efficiency claims (reductions in tokens, time, communication, and failures) rest on experimental demonstrations, yet the abstract and framing provide no quantitative results, error bars, statistical tests, or controls. This leaves the magnitude and reliability of the reported improvements unassessable from the given material; the full manuscript must supply detailed tables, ablation studies, and significance testing to support the efficiency-accuracy tradeoff.
  2. [LATTE Protocol Description] The protocol's correctness hinges on agents reliably constructing, updating, and maintaining a consistent shared task graph under partial observability. The manuscript should provide a precise description (with pseudocode or state-transition rules) of conflict resolution, consistency guarantees, and overhead measurements for graph maintenance; without this, the weakest assumption identified in the review cannot be evaluated.
minor comments (2)
  1. [Introduction / Framework Overview] Define the precise data structure of the adaptive task graph (nodes, edges, state fields) with an early figure or formal notation to aid readability.
  2. [Experimental Setup] Clarify how baseline implementations (MetaGPT, Leader-Worker, etc.) were reproduced or adapted to ensure fair comparison of communication and failure metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address the major comments point-by-point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] The central efficiency claims (reductions in tokens, time, communication, and failures) rest on experimental demonstrations, yet the abstract and framing provide no quantitative results, error bars, statistical tests, or controls. This leaves the magnitude and reliability of the reported improvements unassessable from the given material; the full manuscript must supply detailed tables, ablation studies, and significance testing to support the efficiency-accuracy tradeoff.

    Authors: We agree that the abstract should include quantitative highlights to make the efficiency claims more concrete and assessable. The full manuscript provides detailed tables, ablation studies, and comparisons across tasks and models in the experimental section. In the revision, we will update the abstract to report key quantitative results such as average reductions in token usage and time, and we will ensure that error bars and statistical significance are explicitly discussed and visualized in the main text. revision: yes

  2. Referee: [LATTE Protocol Description] The protocol's correctness hinges on agents reliably constructing, updating, and maintaining a consistent shared task graph under partial observability. The manuscript should provide a precise description (with pseudocode or state-transition rules) of conflict resolution, consistency guarantees, and overhead measurements for graph maintenance; without this, the weakest assumption identified in the review cannot be evaluated.

    Authors: We acknowledge the need for a more precise and formal description of the LATTE protocol to allow evaluation of its correctness under partial observability. The manuscript currently describes the protocol in natural language with examples of graph evolution. To strengthen this, we will include pseudocode for the core procedures of task graph construction, update, and conflict resolution, along with a discussion of consistency mechanisms drawn from distributed systems principles. We will also add experimental measurements of the computational overhead for maintaining the shared graph. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with external validation

full rationale

The paper introduces LATTE as a practical coordination protocol for LLM agent teams, drawing inspiration from distributed systems concepts like partial observability but presenting it as an implemented design rather than a mathematical derivation. No equations, fitted parameters, or predictions appear that reduce by construction to inputs. Claims of efficiency gains (reduced tokens, time, conflicts) rest on empirical comparisons to external baselines (MetaGPT, decentralized teams, Leader-Worker hierarchies, static decompositions) across multiple tasks and models. The graph maintenance protocol is described as a design choice to handle consistency under partial views, not derived from self-cited uniqueness theorems or ansatzes. This is a self-contained empirical systems contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that current LLMs possess sufficient reasoning and instruction-following ability to jointly maintain a consistent shared graph under partial observability; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption LLM agents can collaboratively construct and maintain a consistent shared task graph without introducing new coordination failures
    Invoked as the core operating premise of the LATTE protocol.
invented entities (1)
  • Adaptive Task Graph no independent evidence
    purpose: Shared data structure encoding sub-task dependencies, agent assignments, and progress state
    New coordination artifact introduced by the framework; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5538 in / 1312 out tokens · 57433 ms · 2026-05-08T03:40:30.442693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    Executing task graphs using work-stealing

    Kunal Agrawal, Charles E Leiserson, and Jim Sukha. Executing task graphs using work-stealing. In2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–12. IEEE, 2010

  2. [2]

    How we built our multi-agent research system

    Anthropic. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system, June 2025. Anthropic Engineering Blog

  3. [3]

    org/abs/2603.01213

    Frédéric Berdoz, Leonardo Rugli, and Roger Wattenhofer. Can AI agents agree?arXiv preprint arXiv:2603.01213, 2026

  4. [4]

    Graph of thoughts: Solving elaborate problems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690, 2024. doi: 10.1609/aaai.v38i16. 29720

  5. [5]

    Social agents: Collective intelligence improves LLM predic- tions

    Aanisha Bhattacharyya, Abhilekh Borah, Yaman Kumar Singla, Rajiv Ratn Shah, Changyou Chen, and Balaji Krishnamurthy. Social agents: Collective intelligence improves LLM predic- tions. InThe Fourteenth International Conference on Learning Representations, 2026

  6. [6]

    Brooks.The Mythical Man-Month: Essays on Software Engineering

    Frederick P. Brooks.The Mythical Man-Month: Essays on Software Engineering. Addison- Wesley, Reading, MA, 1975. ISBN 0-201-00650-2

  7. [7]

    Herbsleb, and Kathleen M

    Marcelo Cataldo, James D. Herbsleb, and Kathleen M. Carley. Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software de- velopment productivity. InProceedings of the Second ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 2–11, Kaiserslautern, Germany, 2...

  8. [8]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent LLM systems fail?arXiv preprint arXiv:2503.13657, 2025

  9. [9]

    Melvin E. Conway. How do committees invent?Datamation, 14(4):28–31, April 1968

  10. [10]

    The tail at scale,

    Jeffrey Dean and Luiz André Barroso. The tail at scale.Communications of the ACM, 56(2): 74–80, 2013. doi: 10.1145/2408776.2408794

  11. [11]

    MapReduce: Simplified data processing on large clusters

    Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, January 2008. doi: 10.1145/1327452.1327492

  12. [12]

    Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

    Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000. 10

  13. [13]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

  14. [14]

    Improv- ing factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2024

  15. [15]

    On the nature of merge conflicts: A study of 2,731 open source Java projects hosted by GitHub.IEEE Transactions on Software Engineering, 46(8):892–915, 2020

    Gleiph Ghiotto, Leonardo Murta, Márcio Barros, and André van der Hoek. On the nature of merge conflicts: A study of 2,731 open source Java projects hosted by GitHub.IEEE Transactions on Software Engineering, 46(8):892–915, 2020. doi: 10.1109/TSE.2018.2871083

  16. [16]

    Planning with abstract Markov decision processes

    Nakul Gopalan, Michael Littman, James MacGlashan, Shawn Squire, Stefanie Tellex, John Winder, and Lawson Wong. Planning with abstract Markov decision processes. InProceedings of the International Conference on Automated Planning and Scheduling, volume 27, pages 480–488, 2017

  17. [17]

    Doing more with less: Meta-reasoning and meta-learning in humans and machines

    Thomas L Griffiths, Frederick Callaway, Michael B Chang, Erin Grant, Paul M Krueger, and Falk Lieder. Doing more with less: Meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 29:24–30, 2019

  18. [18]

    W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97

  19. [19]

    Selecting computa- tions: Theory and applications

    Nicholas Hay, Stuart Russell, David Tolpin, and Solomon Eyal Shimony. Selecting computa- tions: Theory and applications. InProceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 346–355, 2012

  20. [20]

    Princeton University Press, Princeton, NJ,

    Joseph Henrich.The Secret of Our Success: How Culture is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter. Princeton University Press, Princeton, NJ,

  21. [21]

    Herbsleb and Audris Mockus

    James D. Herbsleb and Audris Mockus. An empirical study of speed and communication in globally distributed software development.IEEE Transactions on Software Engineering, 29(6): 481–494, 2003. doi: 10.1109/TSE.2003.1205177

  22. [22]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2023

  23. [23]

    On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

    Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of LLM-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989, 2024

  24. [24]

    arXiv preprint arXiv:2507.14928 , year=

    Yongrae Jo and Chanik Park. Byzantine-robust decentralized coordination of LLM agents. arXiv preprint arXiv:2507.14928, 2025

  25. [25]

    A concurrent dynamic task graph

    Theodore Johnson. A concurrent dynamic task graph. In1993 International Conference on Parallel Processing-ICPP’93, volume 2, pages 223–230. IEEE, 1993

  26. [26]

    Towards a Science of Scaling Agent Systems

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296, 2025

  27. [27]

    Metareasoning structures, problems, and modes for multiagent systems: A survey.IEEE Access, 8:183080–183089, 2020

    Samuel T Langlois, Oghenetekevwe Akoroda, Estefany Carrillo, Jeffrey W Herrmann, Shapour Azarm, Huan Xu, and Michael Otte. Metareasoning structures, problems, and modes for multiagent systems: A survey.IEEE Access, 8:183080–183089, 2020

  28. [28]

    Agent-oriented planning in multi-agent systems

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InThe Thirteenth International Conference on Learning Representations, 2025. 11

  29. [29]

    More agents is all you need

    Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. arXiv preprint arXiv:2402.05120, 2024

  30. [30]

    Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  31. [31]

    Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

    Shuo Liu, Tianle Chen, Ryan Amiri, and Christopher Amato. Learning decentralized LLM collaboration with multi-agent actor critic.arXiv preprint arXiv:2601.21972, 2026

  32. [32]

    mirroring

    Alan MacCormack, Carliss Baldwin, and John Rusnak. Exploring the duality between product and organizational architectures: A test of the “mirroring” hypothesis.Research Policy, 41(8): 1309–1324, 2012. doi: 10.1016/j.respol.2012.04.011

  33. [33]

    Austern, Aart J

    Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10), pages 135–146, Indianapolis, Indiana, USA, 2010. Association for Computing Machinery....

  34. [34]

    Learning scheduling algorithms for data processing clusters

    Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. InProceedings of the ACM Special Interest Group on Data Communication (SIGCOMM), pages 270–288. ACM,

  35. [35]

    doi: 10.1145/3341302.3342080

  36. [36]

    Language model teams as distributed systems.arXiv preprint arXiv:2603.12229, 2026

    Elizabeth Mieczkowski, Katherine M Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L Griffiths. Language model teams as distributed systems.arXiv preprint arXiv:2603.12229, 2026

  37. [37]

    Ray: A distributed framework for emerging AI applications

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577. USENIX Association, 2018

  38. [38]

    Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

    Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, and James Zou. Multi-agent teams hold experts back.arXiv preprint arXiv:2602.01011, 2026

  39. [39]

    O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023. doi: 10.1145/3586183.3606763

  40. [40]

    Polychronopoulos and David J

    Constantine D. Polychronopoulos and David J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers.IEEE Transactions on Computers, C-36(12): 1425–1439, December 1987. doi: 10.1109/TC.1987.5009495

  41. [41]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15174–15186, 2024

  42. [42]

    A framework for meta-level control in multi-agent systems

    Anita Raja and Victor Lesser. A framework for meta-level control in multi-agent systems. Autonomous Agents and Multi-Agent Systems, 15(2):147–196, 2007

  43. [43]

    Emergent Coordination in Multi-Agent Language Models

    Christoph Riedl. Emergent coordination in multi-agent language models.arXiv preprint arXiv:2510.05174, 2025

  44. [44]

    Benefits and limitations of communication in multi-agent reasoning.arXiv preprint arXiv:2510.13903, 2025

    Michael Rizvi-Martel, Satwik Bhattamishra, Neil Rathi, Guillaume Rabusseau, and Michael Hahn. Benefits and limitations of communication in multi-agent reasoning.arXiv preprint arXiv:2510.13903, 2025

  45. [45]

    Principles of metareasoning.Artificial Intelligence, 49(1-3): 361–395, 1991

    Stuart Russell and Eric Wefald. Principles of metareasoning.Artificial Intelligence, 49(1-3): 361–395, 1991. 12

  46. [46]

    Sackman, W

    H. Sackman, W. J. Erikson, and E. E. Grant. Exploratory experimental studies comparing online and offline programming performance.Communications of the ACM, 11(1):3–11, 1968. doi: 10.1145/362851.362858

  47. [47]

    Agents of chaos, 2026

    Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, et al. Agents of chaos.arXiv preprint arXiv:2602.20021, 2026

  48. [48]

    HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023

  49. [49]

    Multiagent metareasoning through organizational design

    Jason Sleight and Edmund Durfee. Multiagent metareasoning through organizational design. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014

  50. [50]

    The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646(8085):716–723, 2025

    Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646(8085):716–723, 2025

  51. [51]

    Un- derstanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

    Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Un- derstanding and sharing intentions: The origins of cultural cognition.Behavioral and Brain Sciences, 28(5):675–691, 2005

  52. [52]

    Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274, 2002

    Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Transactions on Parallel and Distributed Systems, 13(3):260–274, 2002

  53. [53]

    distributed-systems.net, 2023

    Maarten Van Steen and Andrew S Tanenbaum.Distributed Systems. distributed-systems.net, 2023

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  55. [55]

    Strengthening the case for pair programming.IEEE Software, 17(4):19–25, 2000

    Laurie Williams, Robert R Kessler, Ward Cunningham, and Ron Jeffries. Strengthening the case for pair programming.IEEE Software, 17(4):19–25, 2000

  56. [56]

    Task scheduling in distributed computing systems with a genetic algorithm

    Sung-Ho Woo, Sung-Bong Yang, Shin-Dug Kim, and Tack-Don Han. Task scheduling in distributed computing systems with a genetic algorithm. InProceedings High Performance Computing on the Information Superhighway. HPC Asia’97, pages 301–305. IEEE, 1997

  57. [57]

    Autogen: Enabling next-gen LLM applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

  58. [58]

    Understanding agent scaling in llm-based multi-agent sys- tems via diversity.arXiv preprint arXiv:2602.03794, 2026

    Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu. Understanding agent scaling in LLM-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

  59. [59]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36:11809–11822, 2023

  60. [60]

    Cut the crap: An economical communication pipeline for llm-based multi-agent systems.arXiv preprint arXiv:2410.02506, 2024

    Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for LLM-based multi-agent systems.arXiv preprint arXiv:2410.02506, 2024

  61. [61]

    Position: Science is collaborative—LLM for science should be too

    Terry Jingchen Zhang, Wenyuan Jiang, Yongjin Yang, Sirui Lu, Bernhard Schölkopf, and Zhijing Jin. Position: Science is collaborative—LLM for science should be too. InICLR 2026 Workshop on Foundation Models for Science: Real-World Impact, 2026. Oral

  62. [62]

    Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024. 13

  63. [63]

    Least-to-most prompting enables complex reasoning in large language models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H Chi. Least-to-most prompting enables complex reasoning in large language models. InThe Eleventh International Conference on Learning Representations, 2023. 14 Appendix A1 Related Work 15 A1.1 LLM teams . . . . . ...

  64. [69]

    I can’t start B until A’s output exists

    Synthesize results and ensure successful project completion Work efficiently and delegate appropriately. Trust your teammates to handle their assignments, but provide guidance when needed. Keep communication clear and actionable. Parallelism: Teammates can self-assign from the ready queue — they do not need to wait for you. Your job is to keep the graph c...

  65. [75]

    fix-index

    Graph updates. The task graph is a living document. Use <discover_task> to add new tasks whenever: (a) A teammate reports that tests are still failing after completing their task, (b) you notice a dependency was missed or a prior task produced incorrect output, (c) the project needs a verification or integration pass that wasn’t planned upfront. Example: ...

  66. [76]

    The verifying agent will check correctness and fix any issues

    If a task is high-stakes — it is upstream of many other tasks, or its output is hard to validate later — you can request a verification pass by a second agent: <verify_task id="task-X" /> This inserts a lightweight review task into the graph that must complete before downstream tasks proceed. The verifying agent will check correctness and fix any issues

  67. [77]

    task-X" /> This clears the current owner and resets the task to pending. Then reassign it with <assign_task id=

    Straggler mitigation. If a teammate has been assigned a task for several rounds without completing it, they may be stuck. Use this action to release the task back to pending so it can be reassigned: <release_task id="task-X" /> This clears the current owner and resets the task to pending. Then reassign it with <assign_task id="task-X" to="DevY" /> either ...

  68. [78]

    assigned

    If the test suite is passing but tasks are still marked "assigned" or "in_progress" (e.g. a teammate completed the work but forgot to emit <complete_task>), you can close them directly: <close_task id="task-X" /> Only use this after confirming with <run_tests /> that tests pass. This is the right action when: all tests are green, a task’s work is clearly ...

  69. [81]

    math_utils.py

    To read an existing file’s contents directly, use: <read_file path="math_utils.py" /> This returns the file contents immediately — no script needed. Always prefer this over writing a helper script to print a file. To execute a script and see its output, use: <run_script path="script.py" /> This runs the file and returns stdout/stderr to you. Use this to v...

  70. [83]

    Communicate with the team Lead when blocked or in need of clarification

  71. [84]

    I cannot start B until A’s output exists

    Complete tasks thoroughly before moving to the next one. Be proactive, collaborative, and detail-oriented. Focus on producing high-quality work. Discovering New Tasks Use <discover_task> whenever you uncover work that isn’t already in the task list. When possible, build a wide graph, not a deep one. Only use dependencies to express real implementation ord...

  72. [85]

    ProductManager(Alice) translates the task description into a Product Requirements Docu- ment (PRD), user stories, and a competitive analysis

  73. [86]

    3.ProjectManager(Eve) reads the system design and issues a task list to the Engineer

    Architect(Bob) receives the PRD and produces a system design document, including the Python package name, file structure, and API specifications. 3.ProjectManager(Eve) reads the system design and issues a task list to the Engineer

  74. [87]

    Engineer(Alex, n_borg= 1 ) implements the assigned files sequentially, one file per action, emitting code blocks to shared memory

  75. [88]

    Agents communicate exclusively through a shared publish-subscribe message bus: each role watches a fixed set of upstream action types and acts only when a matching message arrives

    QaEngineer(Edward, test_round_allowed= 5 ) watches for Engineer output and iterates a write-test→run-code→debug-error loop up to the allowed round count. Agents communicate exclusively through a shared publish-subscribe message bus: each role watches a fixed set of upstream action types and acts only when a matching message arrives. The task decomposition...

  76. [90]

    Break down work and strategically assign tasks to team members

  77. [91]

    Monitor progress and coordinate the team

  78. [92]

    Help unblock teammates when they face issues

  79. [93]

    Review work for quality and consistency

  80. [94]

    Trust your teammates to handle their assignments, but provide guidance when needed

    Synthesize results and ensure successful project completion Work efficiently and delegate appropriately. Trust your teammates to handle their assignments, but provide guidance when needed. Keep communication clear and actionable. Available Actions: Do NOT edit files yourself — focus on directing your team and verifying their work

Showing first 80 references.