VeRO: A Harness for Agents to Optimize Agents

Apaar Shanker; Samuel Marc Denton; Varun Ursekar; Veronica Chatrath; Yuan Xue

arxiv: 2602.22480 · v4 · pith:IRRAAFXSnew · submitted 2026-02-25 · 💻 cs.AI · cs.CL· cs.LG

VeRO: A Harness for Agents to Optimize Agents

Varun Ursekar , Apaar Shanker , Veronica Chatrath , Yuan Xue , Samuel Marc Denton This is my paper

Pith reviewed 2026-05-15 18:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords agent optimizationevaluation harnesscoding agentsLLM agentsbenchmarkingversioningempirical study

0 comments

The pith

VERO supplies a reproducible harness with versioned snapshots and structured traces to compare how agents optimize other agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VERO as an evaluation harness built to handle the specific demands of agent optimization, where one coding agent iteratively edits and improves a target agent through cycles of code changes and LLM-generated completions. Because these cycles mix deterministic code with stochastic outputs, VERO records versioned agent snapshots, enforces evaluation budgets, and captures both intermediate reasoning steps and final execution results. Paired with a benchmark suite of target agents and reference tasks, the harness supports an empirical comparison of optimizer configurations to identify which modifications produce consistent gains in target performance. A reader would care because without this kind of controlled capture, progress on building agents that improve agents remains difficult to measure or reproduce across experiments.

Core claim

VERO provides a reproducible evaluation harness featuring versioned agent snapshots, budget-controlled evaluation, and structured execution traces, together with a benchmark suite of target agents and tasks that include reference evaluation procedures. Using this harness, the authors conduct an empirical study that compares different optimizer configurations across tasks and identifies which modifications reliably improve the performance of the target agents being optimized.

What carries the argument

The VERO harness, which supplies versioned agent snapshots, budget-controlled evaluation runs, and structured traces of both reasoning and execution outcomes to support consistent comparisons of agent optimizers.

If this is right

Different optimizer modifications can be tested systematically for their effect on target agent performance across tasks.
Structured traces make it possible to examine why particular edits succeed or fail during optimization cycles.
The benchmark suite supplies standardized reference procedures that future work can use for direct comparison.
Budget limits allow measurement of both performance gains and the computational cost of different optimizer approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same versioning and trace approach could be adapted to create harnesses for optimizing non-coding agents in other domains.
Results from VERO-style studies could guide the design of meta-level agents that perform self-optimization over repeated iterations.
Consistent findings across tasks may point to general editing strategies that work regardless of the specific target agent.

Load-bearing premise

That structured capture of intermediate reasoning and downstream execution outcomes together with budget-controlled evaluation is necessary and sufficient to produce reliable comparisons of agent optimizers.

What would settle it

An experiment in which optimizer rankings and improvement claims derived from final performance scores alone match or contradict the rankings obtained when the same comparisons are run with VERO's versioning, traces, and budget controls.

Figures

Figures reproduced from arXiv: 2602.22480 by Apaar Shanker, Samuel Marc Denton, Varun Ursekar, Veronica Chatrath, Yuan Xue.

**Figure 1.** Figure 1: VERO system architecture. Top (orange): example optimization trajectory. Bottom (green): system components. VERO enforces versioning, reproducible execution, and controlled feedback, enabling systematic comparison of optimizers for agent optimization. 2. The Agent Optimization Benchmark. A standardized suite of target agents, tasks, and evaluation procedures spanning math reasoning, tool use, and multi-ste… view at source ↗

**Figure 2.** Figure 2: Case study results highlighting optimization outcomes across optimizer instructions types for FACTS, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy ratio (iteration accuracy/baseline) across instruction types, aggregated over all three evaluation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: We plot the probability of changes made between successive evaluations in a trajectory falling into one of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Optimizer performance across tasks. Each grouped bar shows the average lift (best score minus initial [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: We plot the probability of changes made between successive evaluations in a trajectory falling into one [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Mean normalized phase at which the optimal commit is discovered across trajectories. We average over [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Top sub-types for 3 types of modifications optimizers make to targets. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Entropy of change types as a function of optimization phase for different scaffold-variant-model triples. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: UMAP-projected text embeddings of Git diffs of commits made in each optimization trajectory with respect to an empty commit. Initial and final commits show natural clusters. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VERO provides a practical harness for agent optimization research with good design for traces and budgets, though its empirical claims on reliable improvements lack sufficient statistical support.

read the letter

VERO is a solid first harness for evaluating how well agents can optimize other agents, but the reported empirical results don't yet back up the 'reliably improve' language with enough statistical detail. The paper's main contribution is introducing this evaluation harness called VERO, which stands for Versioning, Rewards, and Observations. It includes versioned snapshots of agents, ways to control evaluation budgets, and structured traces that capture both the reasoning steps and the actual execution results. This setup is tailored to agent optimization, where one agent edits and improves another through cycles of code changes and tests. That specific focus and the joint capture of reasoning plus outcomes is what sets it apart from general coding agent benchmarks. They also provide a benchmark suite with target agents and tasks, plus reference procedures. Using this, they run comparisons of different optimizer configurations and look at which changes help the target agent perform better. Releasing the whole thing openly is helpful because it gives others a starting point to build on or compare against. The design choices make sense for the problem. Agent optimization isn't like standard software engineering because of the stochastic elements from LLMs. Logging the intermediate reasoning helps debug why certain edits succeed or fail, and budget control keeps things practical. If the harness is easy to use and reproducible, that alone adds value. The weaker part is the empirical study. They claim to identify modifications that reliably improve performance, but without details on how many runs they did per configuration, what the variance was, or any statistical tests, it's difficult to accept the reliability part at face value. Stochastic outputs mean results can vary a lot, so small samples or single runs could lead to misleading conclusions. The stress test note points this out, and it seems valid based on what's described. Overall, this paper is aimed at the community working on autonomous coding agents and self-improvement loops. A reader interested in building better evaluation tools or studying agent optimizers would get practical value from the released harness and benchmarks. It deserves a serious referee because the core idea of a dedicated harness is sound and the release could accelerate work in this area, even if the current experiments need more robustness. I would recommend sending it for peer review, with feedback focused on expanding the experimental analysis to include proper variance reporting and tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces VERO (Versioning, Rewards, and Observations), an evaluation harness for agent optimization—the iterative improvement of a target agent by a coding agent through edit-execute-evaluate cycles. VERO supplies versioned agent snapshots, budget-controlled evaluation, structured execution traces capturing both intermediate reasoning and downstream outcomes, and a benchmark suite of target agents and tasks with reference procedures. Using the harness, the authors perform an empirical study that compares optimizer configurations across tasks and identifies which modifications reliably improve target-agent performance; the harness is released to support further research.

Significance. If the empirical findings prove robust, the work supplies a missing standardized framework for a task that differs from conventional software engineering because of the interleaving of deterministic code with stochastic LLM completions. A reproducible harness with explicit versioning, budget controls, and structured traces could enable systematic, comparable experiments on agent optimizers, thereby accelerating progress on an emerging capability for coding agents.

major comments (2)

[Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.
[§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.

minor comments (2)

[Introduction] The distinction drawn in the introduction between agent optimization and conventional software engineering would be clearer if illustrated by a short, concrete example of an edit-execute-evaluate cycle that produces an intermediate reasoning trace.
[Conclusion / artifact statement] The manuscript states that VERO is released but does not include a permanent artifact link or DOI; this should be added for the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.

Authors: We agree that the statistical foundation of the empirical study in §4 requires explicit description to substantiate the claims. In the revised manuscript we will report the exact number of independent trials run per optimizer configuration, include per-configuration variance and error bars on all performance metrics, specify exclusion criteria for outlier runs, and detail any hypothesis tests used to evaluate whether observed improvements are reliable given the stochasticity of LLM completions. revision: yes
Referee: [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.

Authors: We acknowledge that §3 currently provides only a high-level description. In the revised manuscript we will add concrete implementation details and pseudocode for the core edit-execute-evaluate loop, explicitly documenting how versioned snapshots are created and restored, how evaluation budgets are enforced, how structured traces record both intermediate LLM reasoning and final execution outcomes, and the mechanisms that maintain reproducibility despite stochastic LLM steps. revision: yes

Circularity Check

0 steps flagged

No circularity: harness introduction and empirical demonstration are self-contained

full rationale

The paper presents VERO as a new evaluation harness with versioning, rewards, observations, budget control, and structured traces, then uses it for an empirical comparison of optimizer configurations on target agents and tasks. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is the release of the harness plus a demonstration study; neither reduces by construction to its own inputs, self-citations, or renamed known results. The work is an engineering contribution whose validity rests on external reproducibility of the released code and benchmarks rather than any internal derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that agent optimization requires joint capture of reasoning traces and execution outcomes; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Agent optimization differs fundamentally from conventional software engineering because the target interleaves deterministic code with stochastic LLM completions.
Explicitly stated in the abstract as the motivation for the new harness.

invented entities (1)

VERO harness no independent evidence
purpose: Provide versioned snapshots, budget control, and structured traces for agent-optimization evaluation.
Newly defined and released in the paper; no independent evidence outside this work is supplied.

pith-pipeline@v0.9.0 · 5465 in / 1228 out tokens · 37563 ms · 2026-05-15T18:59:33.979445+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking
cs.AI 2026-05 unverdicted novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Insights Generator is a multi-agent system that produces evidence-backed natural-language reports characterizing systematic patterns across populations of LLM agent execution traces.
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Insights Generator is a multi-agent system that generates evidence-backed natural-language insights characterizing systematic patterns across corpora of LLM agent execution traces.