Recognition: unknown
HARBOR: Automated Harness Optimization
Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3
The pith
Treating language-model agent harness design as a machine-learning optimization problem allows automated search to outperform manual tuning for large flag spaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Harness design is a first-class machine-learning problem. Automated configuration search using constrained noisy Bayesian optimization dominates manual stacking once the flag space exceeds a handful of bits. We provide the HARBOR reference solver based on a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions, and validate it in an end-to-end run against manual tuning on a coding-agent task suite.
What carries the argument
The HARBOR solver, which performs constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space using a block-additive SAAS surrogate, multi-fidelity acquisition, and trust regions, with cold-start-corrected rewards and posterior chance-constrained safety checks.
If this is right
- The automated method will locate higher-reward harness configurations than manual tuning when the number of flags grows beyond a small set.
- The same optimization framework applies to any agent harness with a bounded flag space and reproducible task suite.
- Cost-aware acquisition and safety checks keep the search efficient and prevent unsafe states during optimization.
- Cold-start corrections improve reward estimation for configurations that have not been fully evaluated.
Where Pith is reading between the lines
- Agent developers could integrate this optimization into training loops to continuously refine harnesses alongside model updates.
- Similar automated approaches might improve configuration of other complex systems like reinforcement learning environments or data processing pipelines.
- If successful, it reduces reliance on expert prompt engineering and harness crafting, potentially democratizing high-performance agent design.
Load-bearing premise
The configuration space remains bounded and the task suite is fully reproducible, allowing the optimization procedure to locate high-reward configurations without prohibitive cost or unsafe states.
What would settle it
A direct comparison on the same production coding agent and task suite where the best manual configuration after equivalent evaluation effort matches or exceeds the HARBOR result would falsify the claim that automated search dominates.
Figures
read the original abstract
Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that harness design for long-horizon language-model agents is a first-class machine-learning problem best solved by automated configuration search. It formalizes harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous space with cold-start-corrected rewards and a posterior chance-constrained safety check, introduces the HARBOR solver (block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, TuRBO trust regions), and supports the claim via a controlled case study contrasting four-round manual tuning against an end-to-end HARBOR run on a production coding-agent harness with a fixed task suite. The formulation is presented as task-class agnostic for any bounded flag space and reproducible task suite.
Significance. If the case study were augmented with quantitative metrics showing clear outperformance in flag spaces beyond a small number of bits, along with reproducibility details, the work could meaningfully shift agent development practices toward systematic optimization rather than manual stacking. The task-agnostic formalization, reference solver components, and emphasis on safety and cost heterogeneity are positive contributions that could be adopted if empirically grounded.
major comments (2)
- [case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.
- [abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.
minor comments (1)
- [formalization] The description of the posterior chance-constrained safety check would benefit from an explicit equation or pseudocode to clarify its integration with the acquisition function.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical details and reproducibility of the case study.
read point-by-point responses
-
Referee: [case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.
Authors: We agree that the exact cardinality and bit-width of the flag space must be stated explicitly for the central claim to be assessable. The revised manuscript will report the precise configuration space of the production coding-agent harness, including the number of flags, their types and domains, and the resulting search-space cardinality. We will also expand the case-study description to detail the manual-tuning protocol, including the specific steps taken across the four rounds, observed performance trends that indicate diminishing returns, and a brief discussion of why alternative manual strategies were not exhaustively benchmarked. revision: yes
-
Referee: [abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.
Authors: We concur that quantitative metrics are required to substantiate the performance advantage. The revised version will augment both the abstract and the case-study section with the missing numerical results, including per-round reward values, standard-error bars where multiple runs are available, convergence curves, and explicit validation metrics that directly compare the HARBOR run against the manual baseline on the fixed task suite. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper formalizes harness optimization as constrained noisy Bayesian optimization with a block-additive SAAS surrogate and multi-fidelity acquisition, then reports an empirical case study contrasting four-round manual tuning against an end-to-end HARBOR run on a fixed task suite. No equations, predictions, or first-principles results reduce by construction to fitted inputs or self-citations; the configuration space, reward correction, and safety check are defined independently of the specific case-study outcomes. The dominance claim is defended by the case study rather than by re-deriving the same quantities from the optimization itself. The formulation is explicitly task-agnostic and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work. This is the normal, non-circular outcome for a methods-plus-case-study paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Harness configuration space is mixed-variable, cost-heterogeneous, and bounded with a reproducible task suite
Forward citations
Cited by 1 Pith paper
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Reference graph
Works this paper leans on
-
[1]
Agentic Code Optimization via Compiler-LLM Cooperation
Mikek, B., Vashchilenko, D., Lu, B., and Xu, P. Agentic Code Optimization via Compiler-LLM Cooperation.arXiv preprint arXiv:2604.04238, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
ACON : Optimizing context compression for long-horizon LLM agents, 2025
Kang, M., Chen, W.-N., Han, D., Inan, H. A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S. ACON: Optimizing Context Compression for Long-horizon LLM Agents.arXiv preprint arXiv:2510.00615, 2025
-
[3]
2025 Was Agents
Gupta, A. 2025 Was Agents. 2026 Is Agent Harnesses.Medium, 2026
2025
-
[4]
R., Daulton, S., Letham, B., Wilson, A
Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization.NeurIPS, 2020
2020
-
[5]
Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026
Datadog Engineering. Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026
2026
-
[6]
arXiv preprint arXiv:2509.11079 , year=
Su, J., Lan, Q., Xia, Y ., Sun, L., Tian, W., Shi, T., Song, X., He, L., and Jingsong, Y . Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows.arXiv preprint arXiv:2509.11079, 2025
-
[7]
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Liu, J., Zhao, X., Shang, X., and Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv preprintarXiv:2604.14228, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026
Atlan Engineering. RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026
2026
-
[9]
R., Turner, R
Eriksson, D., Pearce, M., Gardner, J. R., Turner, R. D., and Poloczek, M. Scalable Global Optimization via Local Bayesian Optimization.NeurIPS, 2019
2019
-
[10]
and Jankowiak, M
Eriksson, D. and Jankowiak, M. High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces.UAI, 2021
2021
-
[11]
A., and Bakshy, E
Daulton, S., Wan, X., Eriksson, D., Balandat, M., Osborne, M. A., and Bakshy, E. Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization.NeurIPS, 2022
2022
-
[12]
Garrido-Merchán, E. C. and Hernández-Lobato, D. Dealing with Categorical and Integer-Valued Variables in Bayesian Opti- mization with Gaussian Processes.Neurocomputing, 380:20–35, 2020
2020
-
[13]
I., and Wilson, A
Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical Multi-fidelity Bayesian Optimization for Hyperparam- eter Tuning.UAI, 2020
2020
-
[14]
Duvenaud, D., Nickisch, H., and Rasmussen, C. E. Additive Gaussian Processes.NIPS, 2011
2011
-
[15]
Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017
Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017
2017
-
[16]
LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024
EleutherAI. LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024
2024
-
[17]
BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018
Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018
2018
-
[18]
Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V ., Lakshmanan, L. V . S., and Awadallah, A. H. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.ICLR, 2024
2024
-
[19]
Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026
Lopopolo, R. Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026
2026
-
[20]
Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval
Shakir, A., Aarsen, T., and Lee, S. Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval. Hugging Face Blog, 2024
2024
-
[21]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024
Jimenez, C., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024
2024
-
[22]
Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023
Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023
2023
-
[23]
Matryoshka Representation Learning.NeurIPS, 2022
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ra- manujan, V ., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. Matryoshka Representation Learning.NeurIPS, 2022
2022
-
[24]
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Tal- walkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018
2018
-
[25]
Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025
LMCache Team. Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025
2025
-
[26]
Meta-Harness: End-to-End Optimization of Model Harnesses
Lee, Y ., Nair, R., Zhang, Q., Lee, K., Khattab, O., and Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprintarXiv:2603.28052, 2026
work page internal anchor Pith review arXiv 2026
-
[27]
Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026
OpenTelemetry Specification Working Group. Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026
2026
-
[28]
Hu, Z., Pan, Z., Kaur, P., Murthy, V ., Yu, Z., Guan, Y ., Wang, Z., Swanson, S., and Ding, Y . Pancake: Hierarchical Mem- ory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477, 2026
-
[29]
Sui, Y ., Zhao, H., Ma, R., He, Z., Wang, H., Li, J., and Yang, Y . Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897, 2026
-
[30]
On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022
Ram, S., Zhang, S., Gong, J., and Roth, D. On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022
2022
-
[31]
Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval.ICLR, 2024
2024
-
[32]
Liu, X., Atalar, B., Dai, X., Zuo, J., Wang, S., Lui, J. C. S., Chen, W., and Joe-Wong, C. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. Proceedings of INFOCOM, 2026
2026
-
[33]
Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023
Schreiter, J., Nguyen-Tuong, D., and Toussaint, M. Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023
-
[34]
Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023
2023
- [35]
-
[36]
Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms.NeurIPS, 2012
2012
- [37]
-
[38]
Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026
Laude Institute and Stanford University. Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026
2026
-
[39]
How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026
Krafton AI Research. How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026
2026
-
[40]
Wang, J. and Sengupta, B. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python.arXiv preprintarXiv:2604.11518, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026
Watershed Engineering. A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026
2026
-
[42]
Wilson, E. B. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927
1927
-
[43]
Turboquant: Online vector quantization with near-optimal distortion rate,
Zandieh, A., Daliri, M., Hadian, M., and Mirrokni, V . Turbo- Quant: Online Vector Quantization with Near-Optimal Distortion © 2026 JP Morgan Chase & Co. All rights reserved 13 Rate.arXiv preprintarXiv:2504.19874, 2025. © 2026 JP Morgan Chase & Co. All rights reserved 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.