arxiv: 2602.22480 · v3 · submitted 2026-02-25 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar , Apaar Shanker , Veronica Chatrath , Yuan (Emily) Xue , Sam Denton

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords agent optimizationevaluation harnesscoding agentsLLM agentsbenchmarkingversioningempirical study

0 comments

The pith

VERO supplies a reproducible harness with versioned snapshots and structured traces to compare how agents optimize other agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VERO as an evaluation harness built to handle the specific demands of agent optimization, where one coding agent iteratively edits and improves a target agent through cycles of code changes and LLM-generated completions. Because these cycles mix deterministic code with stochastic outputs, VERO records versioned agent snapshots, enforces evaluation budgets, and captures both intermediate reasoning steps and final execution results. Paired with a benchmark suite of target agents and reference tasks, the harness supports an empirical comparison of optimizer configurations to identify which modifications produce consistent gains in target performance. A reader would care because without this kind of controlled capture, progress on building agents that improve agents remains difficult to measure or reproduce across experiments.

Core claim

VERO provides a reproducible evaluation harness featuring versioned agent snapshots, budget-controlled evaluation, and structured execution traces, together with a benchmark suite of target agents and tasks that include reference evaluation procedures. Using this harness, the authors conduct an empirical study that compares different optimizer configurations across tasks and identifies which modifications reliably improve the performance of the target agents being optimized.

What carries the argument

The VERO harness, which supplies versioned agent snapshots, budget-controlled evaluation runs, and structured traces of both reasoning and execution outcomes to support consistent comparisons of agent optimizers.

If this is right

Different optimizer modifications can be tested systematically for their effect on target agent performance across tasks.
Structured traces make it possible to examine why particular edits succeed or fail during optimization cycles.
The benchmark suite supplies standardized reference procedures that future work can use for direct comparison.
Budget limits allow measurement of both performance gains and the computational cost of different optimizer approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same versioning and trace approach could be adapted to create harnesses for optimizing non-coding agents in other domains.
Results from VERO-style studies could guide the design of meta-level agents that perform self-optimization over repeated iterations.
Consistent findings across tasks may point to general editing strategies that work regardless of the specific target agent.

Load-bearing premise

That structured capture of intermediate reasoning and downstream execution outcomes together with budget-controlled evaluation is necessary and sufficient to produce reliable comparisons of agent optimizers.

What would settle it

An experiment in which optimizer rankings and improvement claims derived from final performance scores alone match or contradict the rankings obtained when the same comparisons are run with VERO's versioning, traces, and budget controls.

Figures

Figures reproduced from arXiv: 2602.22480 by Apaar Shanker, Sam Denton, Varun Ursekar, Veronica Chatrath, Yuan (Emily) Xue.

**Figure 1.** Figure 1: VERO system architecture. Top (orange): example optimization trajectory. Bottom (green): system components. VERO enforces versioning, reproducible execution, and controlled feedback, enabling systematic comparison of optimizers for agent optimization. 2. The Agent Optimization Benchmark. A standardized suite of target agents, tasks, and evaluation procedures spanning math reasoning, tool use, and multi-ste… view at source ↗

**Figure 2.** Figure 2: Case study results highlighting optimization outcomes across optimizer instructions types for FACTS, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy ratio (iteration accuracy/baseline) across instruction types, aggregated over all three evaluation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: We plot the probability of changes made between successive evaluations in a trajectory falling into one of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Optimizer performance across tasks. Each grouped bar shows the average lift (best score minus initial [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: We plot the probability of changes made between successive evaluations in a trajectory falling into one [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Mean normalized phase at which the optimal commit is discovered across trajectories. We average over [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Top sub-types for 3 types of modifications optimizers make to targets. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Entropy of change types as a function of optimization phase for different scaffold-variant-model triples. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: UMAP-projected text embeddings of Git diffs of commits made in each optimization trajectory with respect to an empty commit. Initial and final commits show natural clusters. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERO provides a practical harness for agent optimization research with good design for traces and budgets, though its empirical claims on reliable improvements lack sufficient statistical support.

read the letter

VERO is a solid first harness for evaluating how well agents can optimize other agents, but the reported empirical results don't yet back up the 'reliably improve' language with enough statistical detail. The paper's main contribution is introducing this evaluation harness called VERO, which stands for Versioning, Rewards, and Observations. It includes versioned snapshots of agents, ways to control evaluation budgets, and structured traces that capture both the reasoning steps and the actual execution results. This setup is tailored to agent optimization, where one agent edits and improves another through cycles of code changes and tests. That specific focus and the joint capture of reasoning plus outcomes is what sets it apart from general coding agent benchmarks. They also provide a benchmark suite with target agents and tasks, plus reference procedures. Using this, they run comparisons of different optimizer configurations and look at which changes help the target agent perform better. Releasing the whole thing openly is helpful because it gives others a starting point to build on or compare against. The design choices make sense for the problem. Agent optimization isn't like standard software engineering because of the stochastic elements from LLMs. Logging the intermediate reasoning helps debug why certain edits succeed or fail, and budget control keeps things practical. If the harness is easy to use and reproducible, that alone adds value. The weaker part is the empirical study. They claim to identify modifications that reliably improve performance, but without details on how many runs they did per configuration, what the variance was, or any statistical tests, it's difficult to accept the reliability part at face value. Stochastic outputs mean results can vary a lot, so small samples or single runs could lead to misleading conclusions. The stress test note points this out, and it seems valid based on what's described. Overall, this paper is aimed at the community working on autonomous coding agents and self-improvement loops. A reader interested in building better evaluation tools or studying agent optimizers would get practical value from the released harness and benchmarks. It deserves a serious referee because the core idea of a dedicated harness is sound and the release could accelerate work in this area, even if the current experiments need more robustness. I would recommend sending it for peer review, with feedback focused on expanding the experimental analysis to include proper variance reporting and tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces VERO (Versioning, Rewards, and Observations), an evaluation harness for agent optimization—the iterative improvement of a target agent by a coding agent through edit-execute-evaluate cycles. VERO supplies versioned agent snapshots, budget-controlled evaluation, structured execution traces capturing both intermediate reasoning and downstream outcomes, and a benchmark suite of target agents and tasks with reference procedures. Using the harness, the authors perform an empirical study that compares optimizer configurations across tasks and identifies which modifications reliably improve target-agent performance; the harness is released to support further research.

Significance. If the empirical findings prove robust, the work supplies a missing standardized framework for a task that differs from conventional software engineering because of the interleaving of deterministic code with stochastic LLM completions. A reproducible harness with explicit versioning, budget controls, and structured traces could enable systematic, comparable experiments on agent optimizers, thereby accelerating progress on an emerging capability for coding agents.

major comments (2)

[Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.
[§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.

minor comments (2)

[Introduction] The distinction drawn in the introduction between agent optimization and conventional software engineering would be clearer if illustrated by a short, concrete example of an edit-execute-evaluate cycle that produces an intermediate reasoning trace.
[Conclusion / artifact statement] The manuscript states that VERO is released but does not include a permanent artifact link or DOI; this should be added for the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.

Authors: We agree that the statistical foundation of the empirical study in §4 requires explicit description to substantiate the claims. In the revised manuscript we will report the exact number of independent trials run per optimizer configuration, include per-configuration variance and error bars on all performance metrics, specify exclusion criteria for outlier runs, and detail any hypothesis tests used to evaluate whether observed improvements are reliable given the stochasticity of LLM completions. revision: yes
Referee: [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.

Authors: We acknowledge that §3 currently provides only a high-level description. In the revised manuscript we will add concrete implementation details and pseudocode for the core edit-execute-evaluate loop, explicitly documenting how versioned snapshots are created and restored, how evaluation budgets are enforced, how structured traces record both intermediate LLM reasoning and final execution outcomes, and the mechanisms that maintain reproducibility despite stochastic LLM steps. revision: yes

Circularity Check

0 steps flagged

No circularity: harness introduction and empirical demonstration are self-contained

full rationale

The paper presents VERO as a new evaluation harness with versioning, rewards, observations, budget control, and structured traces, then uses it for an empirical comparison of optimizer configurations on target agents and tasks. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is the release of the harness plus a demonstration study; neither reduces by construction to its own inputs, self-citations, or renamed known results. The work is an engineering contribution whose validity rests on external reproducibility of the released code and benchmarks rather than any internal derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that agent optimization requires joint capture of reasoning traces and execution outcomes; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Agent optimization differs fundamentally from conventional software engineering because the target interleaves deterministic code with stochastic LLM completions.
Explicitly stated in the abstract as the motivation for the new harness.

invented entities (1)

VERO harness no independent evidence
purpose: Provide versioned snapshots, budget control, and structured traces for agent-optimization evaluation.
Newly defined and released in the paper; no independent evidence outside this work is supplied.

pith-pipeline@v0.9.0 · 5465 in / 1228 out tokens · 37563 ms · 2026-05-15T18:59:33.979445+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 10 internal anchors

[1]

Claude 3.7 Sonnet and Claude Code

Anthropic. Claude 3.7 Sonnet and Claude Code. URL https://www.anthropic.com/news/ claude-3-7-sonnet

work page
[2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, and et al. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Building effective enterprise agents

Boston Consulting Group AI Platforms Group. Building effective enterprise agents. Techni- cal report, Boston Consulting Group, November 2025. URL https://www.bcg.com/assets/2025/ building-effective-enterprise-agents.pdf

work page 2025
[4]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, A. Madry, and L. Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=6s5uXNWGIh

work page 2025
[5]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . de Oliveira Pinto, and et al. Evaluating large language models trained on code, 2021. URLhttps://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Cheng, A

A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, and et al. The facts leaderboard: A comprehensive benchmark for large language model factuality, 2025. URLhttps://arxiv.org/abs/2512.10791

work page arXiv 2025
[7]

Cheng, A

C.-A. Cheng, A. Nie, and A. Swaminathan. Trace is the new autodiff — unlocking efficient optimization of computational workflows. InAutomated Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs,

work page
[8]

URLhttps://openreview.net/forum?id=Lg4AnAFsMd

work page
[9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, and et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

D. Dua, Y. Wang, P . Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota...

work page doi:10.18653/v1/n19-1246 2019
[11]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, and et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=t9U3LW7JVX

work page 2025
[13]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, and et al. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=chfJJYC3iL. 12

work page 2025
[14]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, and et al. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[15]

Khattab, A

O. Khattab, A. Singhvi, P . Maheshwari, Z. Zhang, K. Santhanam, and et al. DSPy: Compiling declara- tive language model calls into self-improving pipelines. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[16]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, and et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Mialon, C

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants,

work page
[18]

URLhttps://arxiv.org/abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv
[19]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P .-S. Huang, and et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025. URLhttps://arxiv.org/abs/2506.13131

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, and et al. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98

work page 2024
[21]

Robeyns, M

M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent, 2025. URL https://arxiv.org/ abs/2504.15228

work page arXiv 2025
[22]

Romera-Paredes, M

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P . Kumar, and et al. Mathematical discoveries from program search with large language models.Nature, 2023. doi: 10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2023
[23]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[24]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, and et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. URLhttps://arxiv.org/abs/2407.16741

work page internal anchor Pith review Pith/arXiv arXiv
[25]

J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, and et al. Measuring short-form factuality in large language models, 2024. URLhttps://arxiv.org/abs/2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

L. Weng. Llm powered autonomous agents.lilianweng.github.io, 2023. URL https://lilianweng.github. io/posts/2023-06-23-agent/

work page 2023
[27]

C. Yang, X. Wang, Y. Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= Bb4VGOWELI

work page 2024
[28]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, and et al. SWE-agent: Agent-computer interfaces enable automated software engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

J. Yang, N. Yonack, K. Zyskowski, D. Yarats, J. Ho, and J. Ma. The adoption and usage of ai agents: Early evidence from perplexity. Technical Report Working Paper 26-040, Harvard Business School, 2025. URLhttps: //www.hbs.edu/ris/Publication%20Files/26-040_ac431922-9f75-4f7d-b6dc-b67bb1c02c50.pdf

work page 2025
[30]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, and et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[31]

S. Yao, N. Shinn, P . Razavi, and K. R. Narasimhan.τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=roNSXZpUDN

work page 2025
[32]

Yin and Z

L. Yin and Z. Wang. Llm-autodiff: Auto-differentiate any llm workflow, 2025. URL https://arxiv.org/abs/ 2501.16673

work page arXiv 2025
[33]

Yuksekgonul, F

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P . Lu, and et al. Optimizing generative ai by backpropagating language model feedback.Nature, 639:609–616, 2025. doi: 10.1038/s41586-025-08661-4. 13

work page doi:10.1038/s41586-025-08661-4 2025
[34]

Zelikman, E

E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. InConference on Language Modeling (CoLM), 2024. URL https://arxiv.org/abs/2310. 02304

work page 2024
[35]

Zhang, J

J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, and et al. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=z5uVAKwmjf

work page 2025
[36]

blank slate

J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin gödel machine: Open-ended evolution of self- improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=pUpzQZTvGY. 14 A. Appendix A.1 VERO Scaffold As mentioned in Section 3.3, VERO as a framework is largely agnostic to the codi...

work page 2026