arxiv: 2604.07236 · v4 · submitted 2026-04-08 · 💻 cs.AI · cs.CL

Recognition: no theorem link

How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent

Sungwoo Jung , Seonil Son

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords agent harnessdeclarative planningLLM residual roleablation studycollaborative battleshipwin ratesymbolic reflection

0 comments

The pith

Declarative planning in the agent harness accounts for most performance gains, leaving the LLM with only a residual role.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper decomposes an agent harness for noisy Collaborative Battleship into four measurable layers and quantifies each layer's contribution to win rate. Adding declarative planning to basic belief tracking raises win rate by 24.1 percentage points even with zero language-model calls. Symbolic reflection produces real but canceling board-level effects, while the language-model revision gate activates on just 4.3 percent of turns. The central insight is that once harness components are made externally observable, the language model's share of competence shrinks from assumed core to measurable remainder.

Core claim

Across 54 games, declarative planning carries the heavy lifting (+24.1pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to ±0.140 F1 that cancel on aggregate; and LLM-backed revision activates on only 4.3% of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central.

What carries the argument

Four progressively richer harness layers (posterior belief tracking, declarative planning, symbolic reflection, and LLM-backed revision gate) isolated under a common runtime so that each layer's marginal contribution to win rate can be measured by ablation.

If this is right

Declarative planning alone can produce the largest share of an agent's planning competence without any language-model calls.
Symbolic reflection mechanisms affect individual boards but require calibration so their signed effects do not cancel in aggregate results.
LLM-backed revision is invoked on only a small fraction of turns and exerts only bounded, non-monotonic influence.
Performance gaps between agent configurations can be attributed to specific harness components rather than inferred solely from end-to-end scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer-isolation technique could be reused on other planning domains to test whether language models add value beyond what symbolic components already supply.
If unmeasured nonlinear interactions exist among layers, sequential ablation may over- or under-state true marginal contributions.
Agent builders might first strengthen symbolic planning layers before scaling language-model usage.

Load-bearing premise

The four harness layers can be isolated and ablated without hidden interactions that would change the measured marginal contributions to win rate.

What would settle it

Re-running the full ablation suite on a new set of 54 games and finding that removing the declarative planning layer no longer drops win rate by approximately 24 points would falsify the claim that it carries the heavy lifting.

Figures

Figures reproduced from arXiv: 2604.07236 by Seonil Son, Sungwoo Jung.

**Figure 3.** Figure 3: Per-board ΔF1 (symbolic reflection on minus off) across the 12 boards recoverable from the registry top/bottom-5 overlap, sorted. Green bars are recovery boards (revision helps); red are over-revision boards (revision hurts). The dotted grey line is the aggregate Δ=−0.001. The bimodal, signed-both-ways shape is visual evidence that the right diagnosis of the flat aggregate is calibration, not absence of m… view at source ↗

**Figure 4.** Figure 4: Qualitative event trace for the B17-seed0 case discussed in §4.2. The trace is intentionally compact because the full runtime log is best inspected as an artifact bundle. References [1] Anthropic. 2025. Claude Code: An agentic coding tool. https://www. anthropic.com/claude-code. [2] Anthropic and community contributors. [n. d.]. agentskills/agentskills. GitHub repository https://github.com/agentskills/agen… view at source ↗

read the original abstract

Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting ($+24.1$pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to $\pm0.140$ F1 that cancel on aggregate; and LLM-backed revision ac tivates on only $4.3\%$ of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper externalizes a planning harness for noisy Collaborative Battleship into four progressively richer layers (posterior belief tracking, declarative planning, symbolic reflection, LLM-backed revision gate) under a common runtime. It pre-specifies win rate as the primary metric and 'heavy lifting' as the single largest positive marginal gain to that metric. Across 54 games, it reports that declarative planning supplies the heavy lifting (+24.1pp win rate over a belief-only harness with zero LLM calls), that symbolic reflection produces signed board-level F1 effects up to ±0.140 that cancel in aggregate, and that the LLM revision gate activates on only 4.3% of turns with bounded non-monotonic impact. The methodological contribution is to render harness layers externally measurable so that the LLM's role can be treated as residual.

Significance. If the ablation results hold, the work supplies a concrete, replicable method for partitioning agent performance between harness and LLM, with direct implications for agent architecture and evaluation. The pre-specification of the primary metric and the zero-LLM baseline are strengths that make the +24.1pp claim falsifiable and comparable across future studies.

major comments (2)

[Abstract / Results (progressive ablation)] The central claim that declarative planning supplies the isolated +24.1pp marginal rests on the assumption that the four layers can be added sequentially without non-additive interactions. The abstract describes the layers as 'externalized' and 'progressively richer' but does not report a crossed factorial design, alternative addition orders, or an explicit test for belief-planning interaction terms. Without such evidence, the reported marginal cannot be unambiguously attributed to declarative planning alone rather than to synergies with the belief representation used in the belief-only condition.
[Abstract / Experimental results] The manuscript reports concrete win-rate deltas and an F1 secondary metric but supplies no error bars, confidence intervals, or per-game variance for the 54-game sample. This omission prevents verification that the +24.1pp difference is statistically distinguishable from zero and undermines the pre-specified 'heavy lifting' designation.

minor comments (2)

[Abstract] The abstract contains typographical artifacts ('pla nning', 'reflec tion', 'ac tivates') that should be corrected for readability.
[Results] The description of the LLM revision gate's activation rate (4.3%) and its 'bounded, non-monotonic effect' would benefit from a brief table or figure showing the distribution of activation contexts and the signed win-rate deltas conditional on activation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Abstract / Results (progressive ablation)] The central claim that declarative planning supplies the isolated +24.1pp marginal rests on the assumption that the four layers can be added sequentially without non-additive interactions. The abstract describes the layers as 'externalized' and 'progressively richer' but does not report a crossed factorial design, alternative addition orders, or an explicit test for belief-planning interaction terms. Without such evidence, the reported marginal cannot be unambiguously attributed to declarative planning alone rather than to synergies with the belief representation used in the belief-only condition.

Authors: We acknowledge that a crossed factorial design would offer stronger protection against undetected interactions. Our progressive ablation follows the conventional approach for isolating marginal contributions in component-wise studies and directly implements the pre-specified definition of 'heavy lifting' as the single largest positive marginal on the primary metric. The belief-only baseline contains zero LLM calls by construction, and the large observed gain from adding declarative planning is consistent with its role as the dominant driver. We have added an explicit limitations paragraph in the revised Discussion section stating the additivity assumption, reporting that no evidence of strong non-additivity was observed in the sequential results, and recommending factorial follow-up experiments for future work. revision: partial
Referee: [Abstract / Experimental results] The manuscript reports concrete win-rate deltas and an F1 secondary metric but supplies no error bars, confidence intervals, or per-game variance for the 54-game sample. This omission prevents verification that the +24.1pp difference is statistically distinguishable from zero and undermines the pre-specified 'heavy lifting' designation.

Authors: We agree that uncertainty quantification is required to support the statistical claims. The revised manuscript now includes bootstrap 95% confidence intervals for all win-rate and F1 deltas, together with the per-game standard deviation. The +24.1pp difference has a confidence interval excluding zero, preserving its designation as heavy lifting. These statistics appear in the abstract, results tables, and methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical win-rate measurements are direct experimental outputs

full rationale

The paper performs an empirical ablation study on four harness layers for a Collaborative Battleship agent, reporting win rates and F1 scores across 54 games. 'Heavy lifting' is explicitly pre-specified as the single largest positive marginal win-rate gain; the +24.1pp figure is the observed difference between belief-only and belief+declarative-planning conditions. No equations, fitted parameters, predictions, or first-principles derivations exist in the described chain. No self-citations are invoked to justify uniqueness or load-bearing premises. The measurements are obtained under a common runtime with externalized layers, making the reported deltas independent experimental results rather than reductions to inputs by construction. The design is self-contained against the stated metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no derivations or parameters are visible.

axioms (1)

domain assumption Win rate is the appropriate primary metric for measuring agent competence in this noisy game.
Explicitly chosen as primary metric in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1077 out tokens · 32442 ms · 2026-05-10T18:34:37.936154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Anthropic. 2025. Claude Code: An agentic coding tool.https://www. anthropic.com/claude-code

2025
[2]

Anthropic and community contributors. [n. d.]. agentskills/agentskills. GitHub repositoryhttps://github.com/agentskills/agentskills. Specifi- cation and documentation for Agent Skills
[3]

Birgitta Böckeler. 2026. Harness Engineering.https://martinfowler. com/articles/exploring-gen-ai/harness-engineering.html. martin- fowler.com

2026
[4]

Can Bölük. 2026. I Improved 15 LLMs at Coding in One After- noon. Only the Harness Changed.https://blog.can.ac/2026/02/12/the- harness-problem/

2026
[5]

Cox, Dana Dannenhauer, Donald Perlis, Stuart C

Michael T. Cox, Dana Dannenhauer, Donald Perlis, Stuart C. Shapiro, and Murugesan Sundaram. 2016. MIDCA: A Metacognitive Integrated Dual-Cycle Architecture. InAAAI Workshop on Metacognitive Machine Learning

2016
[6]

Gabriel Grand et al . 2025. Collaborative Battleship with LLMs as Bayesian Experimental Designers. Preprint

2025
[7]

Nick Haber, Damian Mrowca, Stephanie Wang, Li Fei-Fei, and Daniel L. K. Yamins. 2018. Learning to Play with Intrinsically-Motivated, Self- Aware Agents. InAdvances in Neural Information Processing Systems (NeurIPS)

2018
[8]

Shengran Hu, Cong Lu, and Jeff Clune. 2025. Automated Design of Agentic Systems. InInternational Conference on Learning Representa- tions (ICLR)

2025
[9]

KRAFTON AI and Ludo Robotics. 2026. Terminus-KIRA: Boosting Frontier Model Performance on Terminal-Bench with Minimal Har- ness.https://github.com/krafton-ai/kira

2026
[10]

John Laird, Allen Newell, and Paul Rosenbloom. 1990. Soar: An Ar- chitecture for General Intelligence.Artificial Intelligence33, 1 (1990), 1–64

1990
[11]

Bo Liu et al. 2023. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. Preprint

2023
[12]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill et al. 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces.arXiv preprint arXiv:2601.11868(2026)

work page internal anchor Pith review arXiv 2026
[13]

OpenAI. 2026. Harness engineering: leveraging Codex in an agent- first world.https://openai.com/index/harness-engineering/. OpenAI Blog

2026
[14]

Wojciech Piotrowski et al . 2023. HYDRA: Adaptive Operation in Evolving Open Worlds. Preprint

2023
[15]

Noah Shinn, Federico Cassano, Emanuel Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

2023
[16]

Haotian Tang, David Key, and Kevin Ellis. 2024. WorldCoder: A Model- Based LLM Agent. InAdvances in Neural Information Processing Sys- tems (NeurIPS)

2024
[17]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[18]

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song
[19]

Meta Context Engineering via Agentic Skill Evolution.arXiv preprint arXiv:2601.21557(2026)

work page arXiv 2026
[20]

Justin Young. 2025. Effective harnesses for long-running agents.https://anthropic.com/engineering/effective-harnesses-for- long-running-agents. Anthropic Engineering Blog

2025
[21]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V. Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and K. Olukotun. 2025. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. InarXiv preprint arXiv:2510.04618. 8

work page internal anchor Pith review arXiv 2025