Recognition: no theorem link
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3
The pith
Single-agent LLMs match or outperform multi-agent systems on multi-hop reasoning when given equal thinking tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient than multi-agent systems for multi-hop reasoning, as predicted by the Data Processing Inequality. Empirical tests confirm that SAS match or outperform MAS when tokens are matched, implying reported MAS gains are often due to extra computation or context effects.
What carries the argument
The information-theoretic argument grounded in the Data Processing Inequality, which shows that single-agent systems retain more information than multi-agent coordination under a fixed reasoning token budget.
If this is right
- Multi-agent systems become competitive only when single-agent context utilization is degraded or extra compute is allowed.
- Many reported advantages of multi-agent architectures are explained by unaccounted computation rather than coordination benefits.
- Accurate evaluation of agentic systems requires explicit control over token budgets and context effects.
- Artifacts in API-based budget controls and standard benchmarks can inflate apparent multi-agent gains.
Where Pith is reading between the lines
- Refining single-model reasoning chains may deliver better results than adding agent layers when token use is strictly budgeted.
- Applications sensitive to compute cost could benefit from prioritizing single-agent efficiency over multi-agent frameworks.
- The same token-matching controls could be applied to other tasks such as planning or tool calling to test generality.
- Hybrid designs with minimal coordination might still offer value when single-agent context is noisy or incomplete.
Load-bearing premise
Single-agent systems achieve perfect context utilization without degradation from prompt structure or chain length.
What would settle it
A controlled experiment in which multi-agent systems still outperform single-agent ones after strictly equalizing total reasoning tokens, correcting API budget artifacts, and removing benchmark confounds.
Figures
read the original abstract
Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that single-agent LLM systems (SAS) consistently match or outperform multi-agent systems (MAS) on multi-hop reasoning tasks when reasoning-token budgets are held constant. It grounds this in an information-theoretic argument based on the Data Processing Inequality (predicting SAS information-efficiency under perfect context utilization) and supports it via controlled experiments across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, while diagnosing API budget-control artifacts and benchmark issues that can inflate apparent MAS gains.
Significance. If the result holds after addressing budget-matching concerns, the work is significant because it supplies a falsifiable theoretical prediction and diagnostic methodology for distinguishing architectural benefits from compute/context effects in agentic systems. It directly challenges the common practice of reporting MAS gains without strict token normalization and could redirect evaluation standards toward explicit budget controls, with potential impact on both theory and practical system design.
major comments (3)
- [Empirical Evaluation] Empirical Evaluation section: the headline claim is conditioned on strictly matched reasoning-token budgets, yet the manuscript acknowledges significant API-based budget-control artifacts (especially Gemini 2.5) without providing quantitative measurements of how these artifacts differentially affect token counts for coordination/context in MAS versus SAS. This is load-bearing because any systematic under-counting in MAS would artifactually favor the observed SAS advantage.
- [Information-Theoretic Argument] Information-Theoretic Argument: the DPI-based prediction assumes perfect context utilization in SAS, which the paper itself identifies as the weakest assumption. No experimental controls, ablations, or measurements are reported to verify that this assumption holds under the tested conditions, leaving the theoretical grounding vulnerable if utilization is imperfect.
- [Results] Results section: the manuscript reports consistent SAS advantages but, per the provided assessment, lacks full methodological details and statistical significance testing (e.g., no p-values or confidence intervals for cross-model comparisons). This weakens confidence that the aggregate performance differences are robust rather than noise-driven.
minor comments (2)
- [Abstract] Abstract: the phrase 'three model families' is introduced without immediately naming the exact models (Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5), which reduces immediate clarity for readers.
- [Methodology] Notation: the distinction between 'reasoning tokens' and 'thinking token budgets' is used interchangeably in places; a single explicit definition early in the methodology would prevent potential reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the empirical robustness and theoretical grounding of our work. We address each major concern point-by-point below and have revised the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Empirical Evaluation] the headline claim is conditioned on strictly matched reasoning-token budgets, yet the manuscript acknowledges significant API-based budget-control artifacts (especially Gemini 2.5) without providing quantitative measurements of how these artifacts differentially affect token counts for coordination/context in MAS versus SAS.
Authors: We agree that explicit quantification of API budget-control artifacts is necessary to confirm that token budgets are truly matched. In the revised manuscript we have added a dedicated diagnostic subsection (Section 4.3) that reports measured token overhead for coordination and context passing in each MAS variant across all three model families. For Gemini 2.5 specifically, we now include per-run breakdowns of effective reasoning tokens after subtracting coordination overhead, showing that the SAS advantage remains after correction. These measurements were obtained by logging full token traces from the API responses. revision: yes
-
Referee: [Information-Theoretic Argument] the DPI-based prediction assumes perfect context utilization in SAS, which the paper itself identifies as the weakest assumption. No experimental controls, ablations, or measurements are reported to verify that this assumption holds under the tested conditions.
Authors: The DPI argument is presented as a conditional upper bound that holds only under perfect utilization; we already flag this as the strongest modeling assumption in the original text. While direct per-token utilization measurements are not feasible with black-box APIs, the revised version adds an ablation (Section 5.2) that systematically varies input context length and measures performance drop-off for SAS versus MAS. The results provide indirect support that SAS utilization remains sufficiently high to preserve the predicted efficiency ordering. We have also clarified in the text that the empirical findings do not rely on the assumption being exactly true, but rather that any degradation affects SAS less than the coordination costs in MAS. revision: partial
-
Referee: [Results] the manuscript reports consistent SAS advantages but lacks full methodological details and statistical significance testing (e.g., no p-values or confidence intervals for cross-model comparisons).
Authors: We have expanded the Results section and added a new appendix (Appendix C) containing the complete experimental protocol, including exact prompt templates, token-budget enforcement code, and all hyper-parameters. We now report 95% confidence intervals and two-sided paired t-test p-values for every cross-model and cross-architecture comparison. These additions confirm that the SAS advantages are statistically significant (p < 0.05) in 11 of the 12 primary settings after multiple-comparison correction. revision: yes
Circularity Check
No significant circularity; derivation relies on external DPI theorem and direct empirical controls.
full rationale
The paper grounds its central claim in the Data Processing Inequality, a standard, externally established result from information theory that is independent of the authors' prior work or fitted parameters. The argument applies DPI to compare information efficiency under fixed token budgets and perfect context utilization, then derives testable predictions about when MAS becomes competitive. These predictions are evaluated through controlled experiments across multiple model families with explicit budget matching and artifact diagnostics, rather than by renaming fits or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems imported from the authors appear in the derivation chain. The empirical findings (SAS matching or outperforming MAS under matched budgets) are presented as tests of the DPI-based prediction, not as inputs that define it by construction. This is the expected non-finding for a paper whose theoretical step rests on a machine-checkable external theorem and whose experiments include falsifiable controls.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data Processing Inequality applies to the flow of information in sequences of LLM token generations
Forward citations
Cited by 1 Pith paper
-
Agentic Fuzzing: Opportunities and Challenges
Agentic fuzzing uses LLM agents seeded by historical bugs to reason about root causes, hypothesize variants, and generate PoCs, finding 40 bugs in V8 and 19 in other JS engines.
Reference graph
Works this paper leans on
-
[1]
Identify ambiguities
-
[2]
Formulate at least two plausible interpretations
-
[3]
Evaluate the interpretations and choose the most likely one
-
[4]
Answer based on the most likely interpretation. The question is: D.2 Sequential Multi-Agent System This subsection contains the prompts used for the Sequential planner, step agents, and aggregator. D.2.1 Planner system prompt You are a careful planner. Break the user task into the fewest necessary sequential steps so each step output feeds the next. Outpu...
-
[5]
Carefully compare the Predicted Answer with the Ground Truth Answer
-
[6]
The Ground Truth Answer is always absolutely correct. Do NOT assume otherwise
-
[7]
Do not focus on exact wording unless the exact wording is crucial to the meaning
Consider the substance of the answers - look for equivalent information or correct answers. Do not focus on exact wording unless the exact wording is crucial to the meaning
-
[8]
Your final decision should be based on whether the meaning and the vital facts of the Ground Truth Answer are present in the Predicted Answer: ===Input Data=== - Question: <question> - Predicted Answer: <LLM_response> - Ground Truth Answer: <ground_truth_answer> ===Output Format=== Provide your final evaluation in the following dictionary format: {Explana...
work page 1944
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.