arxiv: 2604.02460 · v2 · submitted 2026-04-02 · 💻 cs.CL · cs.MA

Recognition: no theorem link

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Dat Tran , Douwe Kiela

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3

classification 💻 cs.CL cs.MA

keywords single-agent systemsmulti-agent systemsmulti-hop reasoningtoken budgetinformation theoryData Processing InequalityLLM agentsreasoning tasks

0 comments

The pith

Single-agent LLMs match or outperform multi-agent systems on multi-hop reasoning when given equal thinking tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that reported gains from multi-agent LLM systems often disappear once total reasoning tokens are held fixed across setups. Single-agent systems preserve information more efficiently because they avoid extra processing steps that add no new data. The authors ground this in the Data Processing Inequality and confirm it empirically across three model families on multi-hop tasks. They also document how API budget controls and benchmark artifacts can make multi-agent systems appear stronger than they are. The core point is that coordination overhead does not automatically improve reasoning once compute is matched.

Core claim

Under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient than multi-agent systems for multi-hop reasoning, as predicted by the Data Processing Inequality. Empirical tests confirm that SAS match or outperform MAS when tokens are matched, implying reported MAS gains are often due to extra computation or context effects.

What carries the argument

The information-theoretic argument grounded in the Data Processing Inequality, which shows that single-agent systems retain more information than multi-agent coordination under a fixed reasoning token budget.

If this is right

Multi-agent systems become competitive only when single-agent context utilization is degraded or extra compute is allowed.
Many reported advantages of multi-agent architectures are explained by unaccounted computation rather than coordination benefits.
Accurate evaluation of agentic systems requires explicit control over token budgets and context effects.
Artifacts in API-based budget controls and standard benchmarks can inflate apparent multi-agent gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Refining single-model reasoning chains may deliver better results than adding agent layers when token use is strictly budgeted.
Applications sensitive to compute cost could benefit from prioritizing single-agent efficiency over multi-agent frameworks.
The same token-matching controls could be applied to other tasks such as planning or tool calling to test generality.
Hybrid designs with minimal coordination might still offer value when single-agent context is noisy or incomplete.

Load-bearing premise

Single-agent systems achieve perfect context utilization without degradation from prompt structure or chain length.

What would settle it

A controlled experiment in which multi-agent systems still outperform single-agent ones after strictly equalizing total reasoning tokens, correcting API budget artifacts, and removing benchmark confounds.

Figures

Figures reproduced from arXiv: 2604.02460 by Dat Tran, Douwe Kiela.

**Figure 2.** Figure 2: MuSiQue 4-hop accuracy across Gemini model versions with unlimited thinking [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Context degradation results on MuSiQue 4-hop with Qwen3-30B-A3B under [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent's effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-agent LLMs match or beat multi-agent systems on multi-hop reasoning once token budgets are strictly equalized, with the paper using DPI to explain the efficiency edge.

read the letter

The core finding is that single-agent systems hold their own or do better than multi-agent ones on multi-hop reasoning when reasoning tokens are held constant. The authors ground this in the Data Processing Inequality, which predicts single agents are more information-efficient if context is used perfectly, and they test it with matched-budget runs across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5 plus several MAS variants. They also run diagnostics that flag how API token controls and benchmark quirks can inflate MAS gains by letting extra compute or context slip through uncounted. That diagnostic work is the most useful part; it gives concrete examples of the artifacts, especially with Gemini, and shows why some earlier MAS results may not hold up under tighter accounting. The empirical pattern is consistent enough across the three model families to make the claim worth checking. The soft spot is the token-budget equalization itself. The paper acknowledges API control problems, particularly for Gemini 2.5, so any under-counting of coordination tokens in MAS or imperfect context use in SAS could shrink the reported single-agent advantage. The perfect-context-utilization assumption in the theory is also optimistic for real models, though the authors note MAS can close the gap when that assumption fails. This paper is for people who build or evaluate agentic reasoning systems and want clearer compute controls. It is not revolutionary but it is a solid reminder that many MAS wins may be compute artifacts rather than architectural wins. I would send it to peer review; the experiments are direct, the confounds are named, and the information-theoretic framing is simple enough to stress-test in revision.

Referee Report

3 major / 2 minor

Summary. The paper claims that single-agent LLM systems (SAS) consistently match or outperform multi-agent systems (MAS) on multi-hop reasoning tasks when reasoning-token budgets are held constant. It grounds this in an information-theoretic argument based on the Data Processing Inequality (predicting SAS information-efficiency under perfect context utilization) and supports it via controlled experiments across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, while diagnosing API budget-control artifacts and benchmark issues that can inflate apparent MAS gains.

Significance. If the result holds after addressing budget-matching concerns, the work is significant because it supplies a falsifiable theoretical prediction and diagnostic methodology for distinguishing architectural benefits from compute/context effects in agentic systems. It directly challenges the common practice of reporting MAS gains without strict token normalization and could redirect evaluation standards toward explicit budget controls, with potential impact on both theory and practical system design.

major comments (3)

[Empirical Evaluation] Empirical Evaluation section: the headline claim is conditioned on strictly matched reasoning-token budgets, yet the manuscript acknowledges significant API-based budget-control artifacts (especially Gemini 2.5) without providing quantitative measurements of how these artifacts differentially affect token counts for coordination/context in MAS versus SAS. This is load-bearing because any systematic under-counting in MAS would artifactually favor the observed SAS advantage.
[Information-Theoretic Argument] Information-Theoretic Argument: the DPI-based prediction assumes perfect context utilization in SAS, which the paper itself identifies as the weakest assumption. No experimental controls, ablations, or measurements are reported to verify that this assumption holds under the tested conditions, leaving the theoretical grounding vulnerable if utilization is imperfect.
[Results] Results section: the manuscript reports consistent SAS advantages but, per the provided assessment, lacks full methodological details and statistical significance testing (e.g., no p-values or confidence intervals for cross-model comparisons). This weakens confidence that the aggregate performance differences are robust rather than noise-driven.

minor comments (2)

[Abstract] Abstract: the phrase 'three model families' is introduced without immediately naming the exact models (Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5), which reduces immediate clarity for readers.
[Methodology] Notation: the distinction between 'reasoning tokens' and 'thinking token budgets' is used interchangeably in places; a single explicit definition early in the methodology would prevent potential reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the empirical robustness and theoretical grounding of our work. We address each major concern point-by-point below and have revised the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Empirical Evaluation] the headline claim is conditioned on strictly matched reasoning-token budgets, yet the manuscript acknowledges significant API-based budget-control artifacts (especially Gemini 2.5) without providing quantitative measurements of how these artifacts differentially affect token counts for coordination/context in MAS versus SAS.

Authors: We agree that explicit quantification of API budget-control artifacts is necessary to confirm that token budgets are truly matched. In the revised manuscript we have added a dedicated diagnostic subsection (Section 4.3) that reports measured token overhead for coordination and context passing in each MAS variant across all three model families. For Gemini 2.5 specifically, we now include per-run breakdowns of effective reasoning tokens after subtracting coordination overhead, showing that the SAS advantage remains after correction. These measurements were obtained by logging full token traces from the API responses. revision: yes
Referee: [Information-Theoretic Argument] the DPI-based prediction assumes perfect context utilization in SAS, which the paper itself identifies as the weakest assumption. No experimental controls, ablations, or measurements are reported to verify that this assumption holds under the tested conditions.

Authors: The DPI argument is presented as a conditional upper bound that holds only under perfect utilization; we already flag this as the strongest modeling assumption in the original text. While direct per-token utilization measurements are not feasible with black-box APIs, the revised version adds an ablation (Section 5.2) that systematically varies input context length and measures performance drop-off for SAS versus MAS. The results provide indirect support that SAS utilization remains sufficiently high to preserve the predicted efficiency ordering. We have also clarified in the text that the empirical findings do not rely on the assumption being exactly true, but rather that any degradation affects SAS less than the coordination costs in MAS. revision: partial
Referee: [Results] the manuscript reports consistent SAS advantages but lacks full methodological details and statistical significance testing (e.g., no p-values or confidence intervals for cross-model comparisons).

Authors: We have expanded the Results section and added a new appendix (Appendix C) containing the complete experimental protocol, including exact prompt templates, token-budget enforcement code, and all hyper-parameters. We now report 95% confidence intervals and two-sided paired t-test p-values for every cross-model and cross-architecture comparison. These additions confirm that the SAS advantages are statistically significant (p < 0.05) in 11 of the 12 primary settings after multiple-comparison correction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external DPI theorem and direct empirical controls.

full rationale

The paper grounds its central claim in the Data Processing Inequality, a standard, externally established result from information theory that is independent of the authors' prior work or fitted parameters. The argument applies DPI to compare information efficiency under fixed token budgets and perfect context utilization, then derives testable predictions about when MAS becomes competitive. These predictions are evaluated through controlled experiments across multiple model families with explicit budget matching and artifact diagnostics, rather than by renaming fits or self-referential definitions. No load-bearing self-citations, ansatzes smuggled via prior work, or uniqueness theorems imported from the authors appear in the derivation chain. The empirical findings (SAS matching or outperforming MAS under matched budgets) are presented as tests of the DPI-based prediction, not as inputs that define it by construction. This is the expected non-finding for a paper whose theoretical step rests on a machine-checkable external theorem and whose experiments include falsifiable controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of applying the Data Processing Inequality to LLM reasoning processes and the assumption that single agents achieve perfect context utilization.

axioms (1)

domain assumption Data Processing Inequality applies to the flow of information in sequences of LLM token generations
Used to argue that multi-agent systems cannot be more information-efficient than single-agent under fixed token budget.

pith-pipeline@v0.9.0 · 5570 in / 1219 out tokens · 55926 ms · 2026-05-13T21:32:47.591476+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic Fuzzing: Opportunities and Challenges
cs.CR 2026-05 conditional novelty 7.0

Agentic fuzzing uses LLM agents seeded by historical bugs to reason about root causes, hypothesize variants, and generate PoCs, finding 40 bugs in V8 and 19 in other JS engines.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 1 Pith paper

[1]

Identify ambiguities

work page
[2]

Formulate at least two plausible interpretations

work page
[3]

Evaluate the interpretations and choose the most likely one

work page
[4]

The question is: D.2 Sequential Multi-Agent System This subsection contains the prompts used for the Sequential planner, step agents, and aggregator

Answer based on the most likely interpretation. The question is: D.2 Sequential Multi-Agent System This subsection contains the prompts used for the Sequential planner, step agents, and aggregator. D.2.1 Planner system prompt You are a careful planner. Break the user task into the fewest necessary sequential steps so each step output feeds the next. Outpu...

work page
[5]

Carefully compare the Predicted Answer with the Ground Truth Answer

work page
[6]

Do NOT assume otherwise

The Ground Truth Answer is always absolutely correct. Do NOT assume otherwise

work page
[7]

Do not focus on exact wording unless the exact wording is crucial to the meaning

Consider the substance of the answers - look for equivalent information or correct answers. Do not focus on exact wording unless the exact wording is crucial to the meaning

work page
[8]

apples-to-apples

Your final decision should be based on whether the meaning and the vital facts of the Ground Truth Answer are present in the Predicted Answer: ===Input Data=== - Question: <question> - Predicted Answer: <LLM_response> - Ground Truth Answer: <ground_truth_answer> ===Output Format=== Provide your final evaluation in the following dictionary format: {Explana...

work page 1944