Recognition: no theorem link
VeRO: An Evaluation Harness for Agents to Optimize Agents
Pith reviewed 2026-05-15 18:59 UTC · model grok-4.3
The pith
VERO supplies a reproducible harness with versioned snapshots and structured traces to compare how agents optimize other agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VERO provides a reproducible evaluation harness featuring versioned agent snapshots, budget-controlled evaluation, and structured execution traces, together with a benchmark suite of target agents and tasks that include reference evaluation procedures. Using this harness, the authors conduct an empirical study that compares different optimizer configurations across tasks and identifies which modifications reliably improve the performance of the target agents being optimized.
What carries the argument
The VERO harness, which supplies versioned agent snapshots, budget-controlled evaluation runs, and structured traces of both reasoning and execution outcomes to support consistent comparisons of agent optimizers.
If this is right
- Different optimizer modifications can be tested systematically for their effect on target agent performance across tasks.
- Structured traces make it possible to examine why particular edits succeed or fail during optimization cycles.
- The benchmark suite supplies standardized reference procedures that future work can use for direct comparison.
- Budget limits allow measurement of both performance gains and the computational cost of different optimizer approaches.
Where Pith is reading between the lines
- The same versioning and trace approach could be adapted to create harnesses for optimizing non-coding agents in other domains.
- Results from VERO-style studies could guide the design of meta-level agents that perform self-optimization over repeated iterations.
- Consistent findings across tasks may point to general editing strategies that work regardless of the specific target agent.
Load-bearing premise
That structured capture of intermediate reasoning and downstream execution outcomes together with budget-controlled evaluation is necessary and sufficient to produce reliable comparisons of agent optimizers.
What would settle it
An experiment in which optimizer rankings and improvement claims derived from final performance scores alone match or contradict the rankings obtained when the same comparisons are run with VERO's versioning, traces, and budget controls.
Figures
read the original abstract
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VERO (Versioning, Rewards, and Observations), an evaluation harness for agent optimization—the iterative improvement of a target agent by a coding agent through edit-execute-evaluate cycles. VERO supplies versioned agent snapshots, budget-controlled evaluation, structured execution traces capturing both intermediate reasoning and downstream outcomes, and a benchmark suite of target agents and tasks with reference procedures. Using the harness, the authors perform an empirical study that compares optimizer configurations across tasks and identifies which modifications reliably improve target-agent performance; the harness is released to support further research.
Significance. If the empirical findings prove robust, the work supplies a missing standardized framework for a task that differs from conventional software engineering because of the interleaving of deterministic code with stochastic LLM completions. A reproducible harness with explicit versioning, budget controls, and structured traces could enable systematic, comparable experiments on agent optimizers, thereby accelerating progress on an emerging capability for coding agents.
major comments (2)
- [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.
- [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.
minor comments (2)
- [Introduction] The distinction drawn in the introduction between agent optimization and conventional software engineering would be clearer if illustrated by a short, concrete example of an edit-execute-evaluate cycle that produces an intermediate reasoning trace.
- [Conclusion / artifact statement] The manuscript states that VERO is released but does not include a permanent artifact link or DOI; this should be added for the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.
Authors: We agree that the statistical foundation of the empirical study in §4 requires explicit description to substantiate the claims. In the revised manuscript we will report the exact number of independent trials run per optimizer configuration, include per-configuration variance and error bars on all performance metrics, specify exclusion criteria for outlier runs, and detail any hypothesis tests used to evaluate whether observed improvements are reliable given the stochasticity of LLM completions. revision: yes
-
Referee: [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.
Authors: We acknowledge that §3 currently provides only a high-level description. In the revised manuscript we will add concrete implementation details and pseudocode for the core edit-execute-evaluate loop, explicitly documenting how versioned snapshots are created and restored, how evaluation budgets are enforced, how structured traces record both intermediate LLM reasoning and final execution outcomes, and the mechanisms that maintain reproducibility despite stochastic LLM steps. revision: yes
Circularity Check
No circularity: harness introduction and empirical demonstration are self-contained
full rationale
The paper presents VERO as a new evaluation harness with versioning, rewards, observations, budget control, and structured traces, then uses it for an empirical comparison of optimizer configurations on target agents and tasks. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is the release of the harness plus a demonstration study; neither reduces by construction to its own inputs, self-citations, or renamed known results. The work is an engineering contribution whose validity rests on external reproducibility of the released code and benchmarks rather than any internal derivation loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agent optimization differs fundamentally from conventional software engineering because the target interleaves deterministic code with stochastic LLM completions.
invented entities (1)
-
VERO harness
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Claude 3.7 Sonnet and Claude Code
Anthropic. Claude 3.7 Sonnet and Claude Code. URL https://www.anthropic.com/news/ claude-3-7-sonnet
-
[2]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, and et al. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Building effective enterprise agents
Boston Consulting Group AI Platforms Group. Building effective enterprise agents. Techni- cal report, Boston Consulting Group, November 2025. URL https://www.bcg.com/assets/2025/ building-effective-enterprise-agents.pdf
work page 2025
-
[4]
J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, A. Madry, and L. Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=6s5uXNWGIh
work page 2025
-
[5]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . de Oliveira Pinto, and et al. Evaluating large language models trained on code, 2021. URLhttps://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [6]
- [7]
-
[8]
URLhttps://openreview.net/forum?id=Lg4AnAFsMd
-
[9]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, and et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
D. Dua, Y. Wang, P . Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota...
-
[11]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, and et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=t9U3LW7JVX
work page 2025
-
[13]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, and et al. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=chfJJYC3iL. 12
work page 2025
-
[14]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, and et al. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[15]
O. Khattab, A. Singhvi, P . Maheshwari, Z. Zhang, K. Santhanam, and et al. DSPy: Compiling declara- tive language model calls into self-improving pipelines. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[16]
M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, and et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [17]
-
[18]
URLhttps://arxiv.org/abs/2311.12983
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P .-S. Huang, and et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025. URLhttps://arxiv.org/abs/2506.13131
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, and et al. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98
work page 2024
-
[21]
M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent, 2025. URL https://arxiv.org/ abs/2504.15228
-
[22]
B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P . Kumar, and et al. Mathematical discoveries from program search with large language models.Nature, 2023. doi: 10.1038/s41586-023-06924-6
-
[23]
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[24]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, and et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. URLhttps://arxiv.org/abs/2407.16741
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, and et al. Measuring short-form factuality in large language models, 2024. URLhttps://arxiv.org/abs/2411.04368
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
L. Weng. Llm powered autonomous agents.lilianweng.github.io, 2023. URL https://lilianweng.github. io/posts/2023-06-23-agent/
work page 2023
-
[27]
C. Yang, X. Wang, Y. Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= Bb4VGOWELI
work page 2024
-
[28]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, and et al. SWE-agent: Agent-computer interfaces enable automated software engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
J. Yang, N. Yonack, K. Zyskowski, D. Yarats, J. Ho, and J. Ma. The adoption and usage of ai agents: Early evidence from perplexity. Technical Report Working Paper 26-040, Harvard Business School, 2025. URLhttps: //www.hbs.edu/ris/Publication%20Files/26-040_ac431922-9f75-4f7d-b6dc-b67bb1c02c50.pdf
work page 2025
-
[30]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, and et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[31]
S. Yao, N. Shinn, P . Razavi, and K. R. Narasimhan.τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=roNSXZpUDN
work page 2025
- [32]
-
[33]
M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P . Lu, and et al. Optimizing generative ai by backpropagating language model feedback.Nature, 639:609–616, 2025. doi: 10.1038/s41586-025-08661-4. 13
-
[34]
E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. InConference on Language Modeling (CoLM), 2024. URL https://arxiv.org/abs/2310. 02304
work page 2024
- [35]
-
[36]
J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin gödel machine: Open-ended evolution of self- improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=pUpzQZTvGY. 14 A. Appendix A.1 VERO Scaffold As mentioned in Section 3.3, VERO as a framework is largely agnostic to the codi...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.