Recognition: 2 theorem links
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Pith reviewed 2026-05-16 18:45 UTC · model grok-4.3
The pith
A multi-agent system with an orchestrator achieves competitive performance on complex AI agent benchmarks without modifications.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Magentic-One is a generalist multi-agent system featuring an Orchestrator that plans and replans tasks while directing specialized agents to execute subtasks involving web browsers, local files, and Python code. It attains statistically competitive performance to existing state-of-the-art systems on the GAIA, AssistantBench, and WebArena benchmarks without any modifications to how the agents operate or interact. The design demonstrates progress toward generalist agentic systems by maintaining effectiveness across varied scenarios through modularity rather than specialization.
What carries the argument
The Orchestrator agent, which plans tasks, tracks progress, recovers from errors through replanning, and delegates to specialized agents for web operation, file navigation, and code writing and execution.
If this is right
- The system integrates web, file, and code capabilities into unified task solving.
- Competitive benchmark results hold without task-specific agent adjustments.
- Agents can be swapped in or out to extend functionality without retraining.
- Rigorous evaluation is supported by AutoGenBench with controls for repetition and isolation.
Where Pith is reading between the lines
- Such systems may accelerate development of AI for new applications by allowing plug-in agents for specific domains.
- Real-world deployment could benefit from the error-recovery mechanisms in long-running tasks.
- Testing on additional benchmarks involving physical or multi-modal actions would reveal scalability limits.
Load-bearing premise
The modular multi-agent design with an orchestrator allows agents to be added or removed without additional prompt tuning or training while maintaining performance across tasks.
What would settle it
A demonstration that adding or removing a specialized agent requires prompt tuning or training to preserve competitive performance on the benchmarks would falsify the modularity claim.
read the original abstract
Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Magentic-One, a multi-agent system with an Orchestrator that plans, tracks progress, and directs specialized agents for web browsing, file navigation, and Python code execution. It reports statistically competitive performance to the state-of-the-art on the GAIA, AssistantBench, and WebArena benchmarks without core modifications to agent capabilities or collaboration mechanisms, while emphasizing a modular design that permits adding or removing agents without additional prompt tuning or training. The work also releases an open-source implementation and AutoGenBench, a tool for controlled, isolated agentic evaluations with built-in repetition support.
Significance. If the benchmark results are statistically robust, the work advances generalist agentic systems by demonstrating a flexible, extensible multi-agent architecture that maintains performance across diverse tasks. The open-source release and AutoGenBench tool strengthen reproducibility and provide a practical evaluation framework for future agent research.
major comments (2)
- [§4 (Experimental Results)] §4 (Experimental Results): The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.
- [§3.2 (Modular Architecture)] §3.2 (Modular Architecture): The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.
minor comments (3)
- [§1 (Introduction)] §1 (Introduction): Add direct citations to the original GAIA, AssistantBench, and WebArena papers when first describing each benchmark.
- [Figure 1 (System Overview)] Figure 1 (System Overview): The diagram should include a legend clarifying the direction of control messages versus data flow between the Orchestrator and specialized agents.
- [AutoGenBench description] AutoGenBench description: Specify the exact isolation mechanisms (e.g., containerization or sandboxing) used to prevent side-effects during repeated benchmark runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript to incorporate the requested details and evidence.
read point-by-point responses
-
Referee: [§4 (Experimental Results)] The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.
Authors: We agree that explicit statistical details are needed to fully support the performance claims. The experiments underlying our results were conducted with 3 independent runs per benchmark to capture variability. In the revised manuscript, we will update Section 4 with tables that report means and standard deviations, include error bars in figures, explicitly state the number of runs, and provide the outcomes of paired t-tests (including p-values) for comparisons against baselines. revision: yes
-
Referee: [§3.2 (Modular Architecture)] The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.
Authors: We appreciate the suggestion to strengthen the empirical support for modularity. While the current results use the full agent team and the architecture is designed for easy addition/removal without prompt changes, we acknowledge that a dedicated ablation would provide clearer evidence. In the revision, we will add an ablation study (new table in Section 3.2) reporting GAIA success rates for the full team versus modified teams (e.g., without the CodeExecutor or WebSurfer agent), with no prompt or training adjustments. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical system description and benchmark evaluation rather than any mathematical derivation chain. Performance claims rest on direct measurements against external public benchmarks (GAIA, AssistantBench, WebArena) using the provided AutoGenBench tool for controlled repetition; no equations, fitted parameters, or predictions are derived from internal data. The modular orchestrator architecture is described at the implementation level without self-definitional reductions, self-citation load-bearing premises, or ansatz smuggling. All load-bearing steps are externally falsifiable via the open-source release and benchmark results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
LedgerCanonicalityZeroParameterComparisonLedger unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code.
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from p...
-
Human-Guided Harm Recovery for Computer Use Agents
Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
-
Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation
A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.
Reference graph
Works this paper leans on
-
[1]
T. Abuelsaad, D. Akkil, P. Dey, A. Jagmohan, A. Vempaty, and R. Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems, 2024
work page 2024
-
[2]
BabyAGI. Github — babyagi. https://github.com/yoheinakajima/babyagi, 2023
work page 2023
-
[3]
R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024
work page 2024
-
[4]
R. Cao, F. Lei, H. Wu, J. Chen, Y. Fu, H. Gao, X. Xiong, H. Zhang, Y. Mao, W. Hu, T. Xie, H. Xu, D. Zhang, S. Wang, R. Sun, P. Yin, C. Xiong, A. Ni, Q. Liu, V. Zhong, L. Chen, K. Yu, and T. Yu. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024
work page 2024
-
[5]
Z. Chen, M. White, R. Mooney, A. Payani, Y. Su, and H. Sun. When is tree search useful for llm planning? it depends on the discriminator, 2024
work page 2024
-
[6]
arXiv preprint arXiv:2401.03428
Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhao, et al. Exploring large language model based intelligent agents: Definitions, meth- ods, and prospects. arXiv preprint arXiv:2401.03428 , 2024
- [7]
-
[8]
X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web, 2023
work page 2023
-
[9]
V. Dibia, A. Fourney, G. Bansal, F. Poursabzi-Sangdeh, H. Liu, and S. Amershi. Aligning offline metrics and human judgments of value for code generation models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 8516–8528, Toronto, Canada, July 2023. Association for Computatio...
work page 2023
- [10]
-
[11]
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
B. J. Grosz and S. Kraus. The evolution of sharedplans. In Proceedings of the International Conference on Multi-Agent Systems , 1999
work page 1999
-
[13]
T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. 21
work page 2024
-
[15]
S. Hong, X. Zheng, J. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
N. R. Jennings and M. Wooldridge. Applications of intelligent agents. In Proceedings of the International Conference on Autonomous Agents , 1998
work page 1998
-
[17]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe- bench: Can language models resolve real-world github issues?, 2024
work page 2024
- [18]
-
[19]
J. Y. Koh, S. McAleer, D. Fried, and R. Salakhutdinov. Tree search for language model agents, 2024
work page 2024
- [20]
-
[21]
G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Commu- nicative agents for ”mind” exploration of large scale language model society, 2023
work page 2023
- [22]
-
[23]
J. Liu, Y. Song, B. Y. Lin, W. Lam, G. Neubig, Y. Li, and X. Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024
work page 2024
-
[24]
N. Liu, L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv e-prints, pages arXiv–2401, 2024
work page 2024
- [25]
-
[26]
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
T. Masterman, S. Besen, M. Sawtell, and A. Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
B. Messing. An introduction to multiagent systems. K¨ unstliche Intell., 17:58–, 2002
work page 2002
- [29]
-
[30]
GAIA: a benchmark for General AI Assistants
G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
WebGPT: Browser-assisted question-answering with human feedback
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
N. J. Nilsson. Stuart russell and peter norvig, artificial intelligence: A modern approach. Artificial Intelligence, 82:369–380, 1996. 22
work page 1996
- [33]
-
[34]
J. Pan, Y. Zhang, N. Tomlin, Y. Zhou, S. Levine, and A. Suhr. Autonomous evaluation and refinement of digital agents, 2024
work page 2024
-
[35]
Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, and Z. Wu. Webcanvas: Benchmarking web agents in online environments, 2024
work page 2024
-
[36]
ART: Automatic multi-step reasoning and tool-use for large language models
B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023
work page 2023
-
[38]
D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. RE- FINER: Reasoning feedback on intermediate representations. In Y. Graham and M. Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1100–1126, St. Julian’s, Malta...
work page 2024
- [39]
-
[40]
Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han, Y. R. Fung, Y. Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y. Ye, B. Li, Z. Tang, J. Yi, Y. Zhu, Z. Dai, L. Yan, X. Cong, Y. Lu, W. Zhao, Y. Huang, J. Yan, X. Han, X. Sun, D. Li, J. Phang, C. Yang, T. Wu, H. Ji, Z. Liu, and M. Sun. Tool lear...
work page 2023
-
[41]
Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023
work page 2023
-
[42]
Trase tops gaia leaderboard, 2024
Red Cell Partners. Trase tops gaia leaderboard, 2024
work page 2024
-
[43]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833 , 2024
- [45]
- [46]
-
[47]
T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 2017
work page 2017
- [48]
- [49]
-
[50]
Y. Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y. Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024
work page 2024
-
[51]
P. Stone and M. Veloso. Multiagent systems: A survey from a machine learning perspective. Auton. Robots, 8(3):345–383, June 2000
work page 2000
-
[52]
Y. Talebirad and A. Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023
work page 2023
-
[53]
M. Tambe. Implementing agent teams in dynamic multiagent environments. Appl. Artif. Intell., 12:189–210, 1998
work page 1998
-
[54]
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Opendevin: An open platform for ai software developers as generalist agents, 2024
work page 2024
-
[56]
Y. Wang, T. Shen, L. Liu, and J. Xie. Sibyl: Simple yet effective agent framework for complex real-world reasoning, 2024
work page 2024
-
[57]
Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024
work page 2024
-
[58]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
M. Wooldridge and N. R. Jennings. Intelligent agents: theory and practice. The Knowledge Engineering Review, 10:115 – 152, 1995
work page 1995
-
[60]
Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. In COLM, 2024
work page 2024
-
[61]
Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024
work page 2024
-
[62]
Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui. The rise and potential of large language model based agents: A survey, 2023
work page 2023
-
[63]
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. Osworld: Bench- marking multimodal agents for open-ended tasks in real computer environments, 2024
work page 2024
-
[65]
H. Yang, S. Yue, and Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions, 2023. 24
work page 2023
-
[66]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023
work page 2023
-
[69]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023
work page 2023
-
[70]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Syner- gizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
- [71]
-
[72]
A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023
work page 2023
- [73]
- [74]
- [75]
- [76]
-
[77]
Z. Zhang and A. Zhang. You only look at screens: Multimodal chain-of-action agents, 2024
work page 2024
- [78]
-
[79]
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. 25 Appendix A Statistical Methodology In Table 1, we report the mean and an error bar for each reported method on the three bench- marks. To obtain the error bar we u...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.