arxiv: 2604.20938 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI

Recognition: unknown

HARBOR: Automated Harness Optimization

Biswa Sengupta, Jinhua Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords harness optimizationBayesian optimizationlanguage model agentsautomated configuration searchagent developmentnoisy optimizationmulti-fidelity methodscoding agents

0 comments

The pith

Treating language-model agent harness design as a machine-learning optimization problem allows automated search to outperform manual tuning for large flag spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the harness wrapping a language model agent, which includes context compaction, tool caching, and memory mechanisms, is the dominant source of complexity and should be optimized automatically rather than by hand. It formalizes this as a constrained noisy Bayesian optimization problem over mixed variables with cost and safety considerations, and introduces the HARBOR solver to perform the search. In a case study with a production coding agent, the automated run surpasses a controlled manual-tuning effort on a fixed task suite. If true, this shifts agent development from iterative human engineering to systematic search, making performance gains more reproducible across different base models. The method is general for any bounded-flag agent setup with reproducible tasks.

Core claim

Harness design is a first-class machine-learning problem. Automated configuration search using constrained noisy Bayesian optimization dominates manual stacking once the flag space exceeds a handful of bits. We provide the HARBOR reference solver based on a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions, and validate it in an end-to-end run against manual tuning on a coding-agent task suite.

What carries the argument

The HARBOR solver, which performs constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space using a block-additive SAAS surrogate, multi-fidelity acquisition, and trust regions, with cold-start-corrected rewards and posterior chance-constrained safety checks.

If this is right

The automated method will locate higher-reward harness configurations than manual tuning when the number of flags grows beyond a small set.
The same optimization framework applies to any agent harness with a bounded flag space and reproducible task suite.
Cost-aware acquisition and safety checks keep the search efficient and prevent unsafe states during optimization.
Cold-start corrections improve reward estimation for configurations that have not been fully evaluated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent developers could integrate this optimization into training loops to continuously refine harnesses alongside model updates.
Similar automated approaches might improve configuration of other complex systems like reinforcement learning environments or data processing pipelines.
If successful, it reduces reliance on expert prompt engineering and harness crafting, potentially democratizing high-performance agent design.

Load-bearing premise

The configuration space remains bounded and the task suite is fully reproducible, allowing the optimization procedure to locate high-reward configurations without prohibitive cost or unsafe states.

What would settle it

A direct comparison on the same production coding agent and task suite where the best manual configuration after equivalent evaluation effort matches or exceeds the HARBOR result would falsify the claim that automated search dominates.

Figures

Figures reproduced from arXiv: 2604.20938 by Biswa Sengupta, Jinhua Wang.

**Figure 1.** Figure 1: HARBOR schematic. A flag-space×fidelity input (x, m) is scored by a block-additive SAAS surrogate; a cost-aware acquisition selects the next batch; runtime telemetry (dashed) feeds Axis-IV silent-flag auto-exclusion, which prunes dimensions from the surrogate in place. The output is a Pareto front on (pass-rate, cost), from which a single deployment configuration is picked subject to the posterior chance… view at source ↗

**Figure 2.** Figure 2: Reference HARBOR solver for harness configuration. µ(c), σ(c) denote GP posterior mean and standard deviation; Tm is the fidelity-m task subset; α 2 ℓ is the per-block kernel scale of Eq. 6; nℓ is the number of distinct block-ℓ projections seen in history H. signal—the rest are either silent (Axis-IV detector, §VII-0g) or dominated by telemetry-level integration failures. SAAS concentrates a half-Cauchy p… view at source ↗

read the original abstract

Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HARBOR gives a clean Bayesian optimization framing for agent harness tuning with a reference solver, but the case study does not quantify flag-space size or manual baseline effort enough to test the dominance claim beyond a handful of bits.

read the letter

The paper's main contribution is to treat harness configuration as a first-class optimization problem rather than manual stacking. It formalizes the task as constrained noisy Bayesian optimization over mixed variables with cost-aware acquisition, cold-start reward correction, and a posterior safety check. HARBOR implements this with a block-additive SAAS surrogate, multi-fidelity elements, and TuRBO trust regions. The setup is presented as task-agnostic, which is a reasonable move if the flag space stays bounded and the task suite is reproducible.

Referee Report

2 major / 1 minor

Summary. The paper claims that harness design for long-horizon language-model agents is a first-class machine-learning problem best solved by automated configuration search. It formalizes harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous space with cold-start-corrected rewards and a posterior chance-constrained safety check, introduces the HARBOR solver (block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, TuRBO trust regions), and supports the claim via a controlled case study contrasting four-round manual tuning against an end-to-end HARBOR run on a production coding-agent harness with a fixed task suite. The formulation is presented as task-class agnostic for any bounded flag space and reproducible task suite.

Significance. If the case study were augmented with quantitative metrics showing clear outperformance in flag spaces beyond a small number of bits, along with reproducibility details, the work could meaningfully shift agent development practices toward systematic optimization rather than manual stacking. The task-agnostic formalization, reference solver components, and emphasis on safety and cost heterogeneity are positive contributions that could be adopted if empirically grounded.

major comments (2)

[case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.
[abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.

minor comments (1)

[formalization] The description of the posterior chance-constrained safety check would benefit from an explicit equation or pseudocode to clarify its integration with the acquisition function.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical details and reproducibility of the case study.

read point-by-point responses

Referee: [case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.

Authors: We agree that the exact cardinality and bit-width of the flag space must be stated explicitly for the central claim to be assessable. The revised manuscript will report the precise configuration space of the production coding-agent harness, including the number of flags, their types and domains, and the resulting search-space cardinality. We will also expand the case-study description to detail the manual-tuning protocol, including the specific steps taken across the four rounds, observed performance trends that indicate diminishing returns, and a brief discussion of why alternative manual strategies were not exhaustively benchmarked. revision: yes
Referee: [abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.

Authors: We concur that quantitative metrics are required to substantiate the performance advantage. The revised version will augment both the abstract and the case-study section with the missing numerical results, including per-round reward values, standard-error bars where multiple runs are available, convergence curves, and explicit validation metrics that directly compare the HARBOR run against the manual baseline on the fixed task suite. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formalizes harness optimization as constrained noisy Bayesian optimization with a block-additive SAAS surrogate and multi-fidelity acquisition, then reports an empirical case study contrasting four-round manual tuning against an end-to-end HARBOR run on a fixed task suite. No equations, predictions, or first-principles results reduce by construction to fitted inputs or self-citations; the configuration space, reward correction, and safety check are defined independently of the specific case-study outcomes. The dominance claim is defended by the case study rather than by re-deriving the same quantities from the optimization itself. The formulation is explicitly task-agnostic and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work. This is the normal, non-circular outcome for a methods-plus-case-study paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that harness configuration can be treated as a bounded, noisy, mixed-variable optimization problem amenable to Bayesian methods.

axioms (1)

domain assumption Harness configuration space is mixed-variable, cost-heterogeneous, and bounded with a reproducible task suite
Invoked when formalizing the problem and when reporting the case study.

pith-pipeline@v0.9.0 · 5518 in / 1117 out tokens · 25466 ms · 2026-05-10T00:37:02.699336+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Agentic Code Optimization via Compiler-LLM Cooperation

Mikek, B., Vashchilenko, D., Lu, B., and Xu, P. Agentic Code Optimization via Compiler-LLM Cooperation.arXiv preprint arXiv:2604.04238, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

Kang, M., Chen, W.-N., Han, D., Inan, H. A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S. ACON: Optimizing Context Compression for Long-horizon LLM Agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025
[3]

2025 Was Agents

Gupta, A. 2025 Was Agents. 2026 Is Agent Harnesses.Medium, 2026

2025
[4]

R., Daulton, S., Letham, B., Wilson, A

Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization.NeurIPS, 2020

2020
[5]

Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026

Datadog Engineering. Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026

2026
[6]

arXiv preprint arXiv:2509.11079 , year=

Su, J., Lan, Q., Xia, Y ., Sun, L., Tian, W., Shi, T., Song, X., He, L., and Jingsong, Y . Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows.arXiv preprint arXiv:2509.11079, 2025

work page arXiv 2025
[7]

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Liu, J., Zhao, X., Shang, X., and Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv preprintarXiv:2604.14228, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026

Atlan Engineering. RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026

2026
[9]

R., Turner, R

Eriksson, D., Pearce, M., Gardner, J. R., Turner, R. D., and Poloczek, M. Scalable Global Optimization via Local Bayesian Optimization.NeurIPS, 2019

2019
[10]

and Jankowiak, M

Eriksson, D. and Jankowiak, M. High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces.UAI, 2021

2021
[11]

A., and Bakshy, E

Daulton, S., Wan, X., Eriksson, D., Balandat, M., Osborne, M. A., and Bakshy, E. Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization.NeurIPS, 2022

2022
[12]

Garrido-Merchán, E. C. and Hernández-Lobato, D. Dealing with Categorical and Integer-Valued Variables in Bayesian Opti- mization with Gaussian Processes.Neurocomputing, 380:20–35, 2020

2020
[13]

I., and Wilson, A

Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical Multi-fidelity Bayesian Optimization for Hyperparam- eter Tuning.UAI, 2020

2020
[14]

Duvenaud, D., Nickisch, H., and Rasmussen, C. E. Additive Gaussian Processes.NIPS, 2011

2011
[15]

Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017

Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017

2017
[16]

LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024

EleutherAI. LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024

2024
[17]

BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018

Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018

2018
[18]

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V ., Lakshmanan, L. V . S., and Awadallah, A. H. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.ICLR, 2024

2024
[19]

Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026

Lopopolo, R. Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026

2026
[20]

Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval

Shakir, A., Aarsen, T., and Lee, S. Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval. Hugging Face Blog, 2024

2024
[21]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024

Jimenez, C., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024

2024
[22]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023

Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023

2023
[23]

Matryoshka Representation Learning.NeurIPS, 2022

Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ra- manujan, V ., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. Matryoshka Representation Learning.NeurIPS, 2022

2022
[24]

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Tal- walkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018

2018
[25]

Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025

LMCache Team. Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025

2025
[26]

Meta-Harness: End-to-End Optimization of Model Harnesses

Lee, Y ., Nair, R., Zhang, Q., Lee, K., Khattab, O., and Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprintarXiv:2603.28052, 2026

work page internal anchor Pith review arXiv 2026
[27]

Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026

OpenTelemetry Specification Working Group. Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026

2026
[28]

Pancake: Hierarchical Mem- ory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477, 2026

Hu, Z., Pan, Z., Kaur, P., Murthy, V ., Yu, Z., Guan, Y ., Wang, Z., Swanson, S., and Ding, Y . Pancake: Hierarchical Mem- ory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477, 2026

work page arXiv 2026
[29]

Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

Sui, Y ., Zhao, H., Ma, R., He, Z., Wang, H., Li, J., and Yang, Y . Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897, 2026

work page arXiv 2026
[30]

On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022

Ram, S., Zhang, S., Gong, J., and Roth, D. On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022

2022
[31]

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval.ICLR, 2024

2024
[32]

Liu, X., Atalar, B., Dai, X., Zuo, J., Wang, S., Lui, J. C. S., Chen, W., and Joe-Wong, C. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. Proceedings of INFOCOM, 2026

2026
[33]

Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023

Schreiter, J., Nguyen-Tuong, D., and Toussaint, M. Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023

work page arXiv 2023
[34]

Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023

2023
[35]

Wang, H., Feng, Y ., Cao, Y ., Xie, X., and Zhou, S. K. SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Re- trieved Context.arXiv preprintarXiv:2505.23841, 2025

work page arXiv 2025
[36]

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms.NeurIPS, 2012

2012
[37]

Swersky, K., Snoek, J., and Adams, R. P. Freeze-Thaw Bayesian Optimization.arXiv preprintarXiv:1406.3896, 2014

work page arXiv 2014
[38]

Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026

Laude Institute and Stanford University. Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026

2026
[39]

How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026

Krafton AI Research. How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026

2026
[40]

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

Wang, J. and Sengupta, B. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python.arXiv preprintarXiv:2604.11518, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026

Watershed Engineering. A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026

2026
[42]

Wilson, E. B. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927

1927
[43]

Turboquant: Online vector quantization with near-optimal distortion rate,

Zandieh, A., Daliri, M., Hadian, M., and Mirrokni, V . Turbo- Quant: Online Vector Quantization with Near-Optimal Distortion © 2026 JP Morgan Chase & Co. All rights reserved 13 Rate.arXiv preprintarXiv:2504.19874, 2025. © 2026 JP Morgan Chase & Co. All rights reserved 14

work page arXiv 2026