Towards a Science of Scaling Agent Systems

Yubin Kim , Ken Gu , Chanwoo Park , Chunjong Park , Samuel Schmidgall , A. Ali Heydari , Yao Yan , Zhihan Zhang , Yuchen Zhuang , Yun Liu , Mark Malhotra , Paul Pu Liang , Hae Won Park , Yuzhe Yang , Xuhai Xu , Yilun Du , Shwetak Patel , Tim Althoff , Daniel McDuff , Xin Liu

Authors on Pith no claims yet

classification 💻 cs.AI

keywords coordinationperformancesystemsacrossagentcentralizedmodelsingle-agent

0 comments

read the original abstract

Agents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross-validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task-grounded capability metric). We identify a robust capability-saturation effect and additional patterns: (1) a coordination yields diminishing returns once single-agent baselines exceed certain performance; (2) tool-heavy tasks appear to incur multi-agent overhead; and (3) architectures without centralized verification tend to propagate errors more than those with centralized coordination. Relative performance change compared to single-agent baseline ranges from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, demonstrating that architecture-task alignment determines collaborative success. The framework identifies the best-performing architecture for 87% of held-out configurations and shows consistent relative architecture preferences on unseen frontier models. Agent effectiveness depends on alignment between coordination and task structure, and that mismatched coordination degrades the performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
cs.MA 2026-05 unverdicted novelty 7.0

LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
cs.CL 2026-04 unverdicted novelty 6.0

TSAssistant is a human-in-the-loop multi-agent system that generates citable, evidence-grounded sections for target safety assessment reports by coordinating specialized subagents with interactive user refinement.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows
cs.MA 2026-04 unverdicted novelty 6.0

Complete cyclic subtask graphs offer a lens to measure when multi-agent revisitation aids recovery and exploration versus when it increases costs or is dominated by other bottlenecks in LLM agent workflows.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
When Independent Sampling Outperforms Agentic Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
cs.CL 2026-04 unverdicted novelty 5.0

TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources a...
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
cs.CR 2026-04 unverdicted novelty 5.0

Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...
The Inverse-Wisdom Law: Architectural Tribalism and the Consensus Paradox in Agentic Swarms
cs.AI 2026-04 unverdicted novelty 4.0

In kinship-dominant agent swarms, adding logical agents increases stability of erroneous trajectories, leading to logic saturation with zero internal entropy but unit factual error.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
cs.AI 2026-04 unverdicted novelty 4.0

EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.