Recognition: no theorem link
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3
The pith
Declarative planning in the agent harness accounts for most performance gains, leaving the LLM with only a residual role.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 54 games, declarative planning carries the heavy lifting (+24.1pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to ±0.140 F1 that cancel on aggregate; and LLM-backed revision activates on only 4.3% of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central.
What carries the argument
Four progressively richer harness layers (posterior belief tracking, declarative planning, symbolic reflection, and LLM-backed revision gate) isolated under a common runtime so that each layer's marginal contribution to win rate can be measured by ablation.
If this is right
- Declarative planning alone can produce the largest share of an agent's planning competence without any language-model calls.
- Symbolic reflection mechanisms affect individual boards but require calibration so their signed effects do not cancel in aggregate results.
- LLM-backed revision is invoked on only a small fraction of turns and exerts only bounded, non-monotonic influence.
- Performance gaps between agent configurations can be attributed to specific harness components rather than inferred solely from end-to-end scores.
Where Pith is reading between the lines
- The layer-isolation technique could be reused on other planning domains to test whether language models add value beyond what symbolic components already supply.
- If unmeasured nonlinear interactions exist among layers, sequential ablation may over- or under-state true marginal contributions.
- Agent builders might first strengthen symbolic planning layers before scaling language-model usage.
Load-bearing premise
The four harness layers can be isolated and ablated without hidden interactions that would change the measured marginal contributions to win rate.
What would settle it
Re-running the full ablation suite on a new set of 54 games and finding that removing the declarative planning layer no longer drops win rate by approximately 24 points would falsify the claim that it carries the heavy lifting.
Figures
read the original abstract
Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting ($+24.1$pp win rate over a belief-only harness, zero LLM calls); symbolic reflection is mechanistically real but calibration-sensitive, with signed board-level effects up to $\pm0.140$ F1 that cancel on aggregate; and LLM-backed revision ac tivates on only $4.3\%$ of turns with a bounded, non-monotonic effect. The contribution is methodological: once harness layers are made externally measurable, the LLM's role can be quantified as residual rather than assumed central.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper externalizes a planning harness for noisy Collaborative Battleship into four progressively richer layers (posterior belief tracking, declarative planning, symbolic reflection, LLM-backed revision gate) under a common runtime. It pre-specifies win rate as the primary metric and 'heavy lifting' as the single largest positive marginal gain to that metric. Across 54 games, it reports that declarative planning supplies the heavy lifting (+24.1pp win rate over a belief-only harness with zero LLM calls), that symbolic reflection produces signed board-level F1 effects up to ±0.140 that cancel in aggregate, and that the LLM revision gate activates on only 4.3% of turns with bounded non-monotonic impact. The methodological contribution is to render harness layers externally measurable so that the LLM's role can be treated as residual.
Significance. If the ablation results hold, the work supplies a concrete, replicable method for partitioning agent performance between harness and LLM, with direct implications for agent architecture and evaluation. The pre-specification of the primary metric and the zero-LLM baseline are strengths that make the +24.1pp claim falsifiable and comparable across future studies.
major comments (2)
- [Abstract / Results (progressive ablation)] The central claim that declarative planning supplies the isolated +24.1pp marginal rests on the assumption that the four layers can be added sequentially without non-additive interactions. The abstract describes the layers as 'externalized' and 'progressively richer' but does not report a crossed factorial design, alternative addition orders, or an explicit test for belief-planning interaction terms. Without such evidence, the reported marginal cannot be unambiguously attributed to declarative planning alone rather than to synergies with the belief representation used in the belief-only condition.
- [Abstract / Experimental results] The manuscript reports concrete win-rate deltas and an F1 secondary metric but supplies no error bars, confidence intervals, or per-game variance for the 54-game sample. This omission prevents verification that the +24.1pp difference is statistically distinguishable from zero and undermines the pre-specified 'heavy lifting' designation.
minor comments (2)
- [Abstract] The abstract contains typographical artifacts ('pla nning', 'reflec tion', 'ac tivates') that should be corrected for readability.
- [Results] The description of the LLM revision gate's activation rate (4.3%) and its 'bounded, non-monotonic effect' would benefit from a brief table or figure showing the distribution of activation contexts and the signed win-rate deltas conditional on activation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [Abstract / Results (progressive ablation)] The central claim that declarative planning supplies the isolated +24.1pp marginal rests on the assumption that the four layers can be added sequentially without non-additive interactions. The abstract describes the layers as 'externalized' and 'progressively richer' but does not report a crossed factorial design, alternative addition orders, or an explicit test for belief-planning interaction terms. Without such evidence, the reported marginal cannot be unambiguously attributed to declarative planning alone rather than to synergies with the belief representation used in the belief-only condition.
Authors: We acknowledge that a crossed factorial design would offer stronger protection against undetected interactions. Our progressive ablation follows the conventional approach for isolating marginal contributions in component-wise studies and directly implements the pre-specified definition of 'heavy lifting' as the single largest positive marginal on the primary metric. The belief-only baseline contains zero LLM calls by construction, and the large observed gain from adding declarative planning is consistent with its role as the dominant driver. We have added an explicit limitations paragraph in the revised Discussion section stating the additivity assumption, reporting that no evidence of strong non-additivity was observed in the sequential results, and recommending factorial follow-up experiments for future work. revision: partial
-
Referee: [Abstract / Experimental results] The manuscript reports concrete win-rate deltas and an F1 secondary metric but supplies no error bars, confidence intervals, or per-game variance for the 54-game sample. This omission prevents verification that the +24.1pp difference is statistically distinguishable from zero and undermines the pre-specified 'heavy lifting' designation.
Authors: We agree that uncertainty quantification is required to support the statistical claims. The revised manuscript now includes bootstrap 95% confidence intervals for all win-rate and F1 deltas, together with the per-game standard deviation. The +24.1pp difference has a confidence interval excluding zero, preserving its designation as heavy lifting. These statistics appear in the abstract, results tables, and methods. revision: yes
Circularity Check
No circularity: empirical win-rate measurements are direct experimental outputs
full rationale
The paper performs an empirical ablation study on four harness layers for a Collaborative Battleship agent, reporting win rates and F1 scores across 54 games. 'Heavy lifting' is explicitly pre-specified as the single largest positive marginal win-rate gain; the +24.1pp figure is the observed difference between belief-only and belief+declarative-planning conditions. No equations, fitted parameters, predictions, or first-principles derivations exist in the described chain. No self-citations are invoked to justify uniqueness or load-bearing premises. The measurements are obtained under a common runtime with externalized layers, making the reported deltas independent experimental results rather than reductions to inputs by construction. The design is self-contained against the stated metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Win rate is the appropriate primary metric for measuring agent competence in this noisy game.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude Code: An agentic coding tool.https://www. anthropic.com/claude-code
2025
-
[2]
Anthropic and community contributors. [n. d.]. agentskills/agentskills. GitHub repositoryhttps://github.com/agentskills/agentskills. Specifi- cation and documentation for Agent Skills
-
[3]
Birgitta Böckeler. 2026. Harness Engineering.https://martinfowler. com/articles/exploring-gen-ai/harness-engineering.html. martin- fowler.com
2026
-
[4]
Can Bölük. 2026. I Improved 15 LLMs at Coding in One After- noon. Only the Harness Changed.https://blog.can.ac/2026/02/12/the- harness-problem/
2026
-
[5]
Cox, Dana Dannenhauer, Donald Perlis, Stuart C
Michael T. Cox, Dana Dannenhauer, Donald Perlis, Stuart C. Shapiro, and Murugesan Sundaram. 2016. MIDCA: A Metacognitive Integrated Dual-Cycle Architecture. InAAAI Workshop on Metacognitive Machine Learning
2016
-
[6]
Gabriel Grand et al . 2025. Collaborative Battleship with LLMs as Bayesian Experimental Designers. Preprint
2025
-
[7]
Nick Haber, Damian Mrowca, Stephanie Wang, Li Fei-Fei, and Daniel L. K. Yamins. 2018. Learning to Play with Intrinsically-Motivated, Self- Aware Agents. InAdvances in Neural Information Processing Systems (NeurIPS)
2018
-
[8]
Shengran Hu, Cong Lu, and Jeff Clune. 2025. Automated Design of Agentic Systems. InInternational Conference on Learning Representa- tions (ICLR)
2025
-
[9]
KRAFTON AI and Ludo Robotics. 2026. Terminus-KIRA: Boosting Frontier Model Performance on Terminal-Bench with Minimal Har- ness.https://github.com/krafton-ai/kira
2026
-
[10]
John Laird, Allen Newell, and Paul Rosenbloom. 1990. Soar: An Ar- chitecture for General Intelligence.Artificial Intelligence33, 1 (1990), 1–64
1990
-
[11]
Bo Liu et al. 2023. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. Preprint
2023
-
[12]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill et al. 2026. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces.arXiv preprint arXiv:2601.11868(2026)
work page internal anchor Pith review arXiv 2026
-
[13]
OpenAI. 2026. Harness engineering: leveraging Codex in an agent- first world.https://openai.com/index/harness-engineering/. OpenAI Blog
2026
-
[14]
Wojciech Piotrowski et al . 2023. HYDRA: Adaptive Operation in Evolving Open Worlds. Preprint
2023
-
[15]
Noah Shinn, Federico Cassano, Emanuel Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)
2023
-
[16]
Haotian Tang, David Key, and Kevin Ellis. 2024. WorldCoder: A Model- Based LLM Agent. InAdvances in Neural Information Processing Sys- tems (NeurIPS)
2024
-
[17]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)
2023
-
[18]
Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song
- [19]
-
[20]
Justin Young. 2025. Effective harnesses for long-running agents.https://anthropic.com/engineering/effective-harnesses-for- long-running-agents. Anthropic Engineering Blog
2025
-
[21]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V. Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and K. Olukotun. 2025. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. InarXiv preprint arXiv:2510.04618. 8
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.