pith. machine review for the scientific record. sign in

arxiv: 2602.22480 · v3 · submitted 2026-02-25 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

VeRO: An Evaluation Harness for Agents to Optimize Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords agent optimizationevaluation harnesscoding agentsLLM agentsbenchmarkingversioningempirical study
0
0 comments X

The pith

VERO supplies a reproducible harness with versioned snapshots and structured traces to compare how agents optimize other agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VERO as an evaluation harness built to handle the specific demands of agent optimization, where one coding agent iteratively edits and improves a target agent through cycles of code changes and LLM-generated completions. Because these cycles mix deterministic code with stochastic outputs, VERO records versioned agent snapshots, enforces evaluation budgets, and captures both intermediate reasoning steps and final execution results. Paired with a benchmark suite of target agents and reference tasks, the harness supports an empirical comparison of optimizer configurations to identify which modifications produce consistent gains in target performance. A reader would care because without this kind of controlled capture, progress on building agents that improve agents remains difficult to measure or reproduce across experiments.

Core claim

VERO provides a reproducible evaluation harness featuring versioned agent snapshots, budget-controlled evaluation, and structured execution traces, together with a benchmark suite of target agents and tasks that include reference evaluation procedures. Using this harness, the authors conduct an empirical study that compares different optimizer configurations across tasks and identifies which modifications reliably improve the performance of the target agents being optimized.

What carries the argument

The VERO harness, which supplies versioned agent snapshots, budget-controlled evaluation runs, and structured traces of both reasoning and execution outcomes to support consistent comparisons of agent optimizers.

If this is right

  • Different optimizer modifications can be tested systematically for their effect on target agent performance across tasks.
  • Structured traces make it possible to examine why particular edits succeed or fail during optimization cycles.
  • The benchmark suite supplies standardized reference procedures that future work can use for direct comparison.
  • Budget limits allow measurement of both performance gains and the computational cost of different optimizer approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same versioning and trace approach could be adapted to create harnesses for optimizing non-coding agents in other domains.
  • Results from VERO-style studies could guide the design of meta-level agents that perform self-optimization over repeated iterations.
  • Consistent findings across tasks may point to general editing strategies that work regardless of the specific target agent.

Load-bearing premise

That structured capture of intermediate reasoning and downstream execution outcomes together with budget-controlled evaluation is necessary and sufficient to produce reliable comparisons of agent optimizers.

What would settle it

An experiment in which optimizer rankings and improvement claims derived from final performance scores alone match or contradict the rankings obtained when the same comparisons are run with VERO's versioning, traces, and budget controls.

Figures

Figures reproduced from arXiv: 2602.22480 by Apaar Shanker, Sam Denton, Varun Ursekar, Veronica Chatrath, Yuan (Emily) Xue.

Figure 1
Figure 1. Figure 1: VERO system architecture. Top (orange): example optimization trajectory. Bottom (green): system components. VERO enforces versioning, reproducible execution, and controlled feedback, enabling systematic comparison of optimizers for agent optimization. 2. The Agent Optimization Benchmark. A standardized suite of target agents, tasks, and evaluation procedures spanning math reasoning, tool use, and multi-ste… view at source ↗
Figure 2
Figure 2. Figure 2: Case study results highlighting optimization outcomes across optimizer instructions types for FACTS, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy ratio (iteration accuracy/baseline) across instruction types, aggregated over all three evaluation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We plot the probability of changes made between successive evaluations in a trajectory falling into one of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Optimizer performance across tasks. Each grouped bar shows the average lift (best score minus initial [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We plot the probability of changes made between successive evaluations in a trajectory falling into one [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean normalized phase at which the optimal commit is discovered across trajectories. We average over [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top sub-types for 3 types of modifications optimizers make to targets. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Entropy of change types as a function of optimization phase for different scaffold-variant-model triples. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: UMAP-projected text embeddings of Git diffs of commits made in each optimization trajectory with respect to an empty commit. Initial and final commits show natural clusters. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: An example trajectory of changes made by V [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VERO (Versioning, Rewards, and Observations), an evaluation harness for agent optimization—the iterative improvement of a target agent by a coding agent through edit-execute-evaluate cycles. VERO supplies versioned agent snapshots, budget-controlled evaluation, structured execution traces capturing both intermediate reasoning and downstream outcomes, and a benchmark suite of target agents and tasks with reference procedures. Using the harness, the authors perform an empirical study that compares optimizer configurations across tasks and identifies which modifications reliably improve target-agent performance; the harness is released to support further research.

Significance. If the empirical findings prove robust, the work supplies a missing standardized framework for a task that differs from conventional software engineering because of the interleaving of deterministic code with stochastic LLM completions. A reproducible harness with explicit versioning, budget controls, and structured traces could enable systematic, comparable experiments on agent optimizers, thereby accelerating progress on an emerging capability for coding agents.

major comments (2)
  1. [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.
  2. [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.
minor comments (2)
  1. [Introduction] The distinction drawn in the introduction between agent optimization and conventional software engineering would be clearer if illustrated by a short, concrete example of an edit-execute-evaluate cycle that produces an intermediate reasoning trace.
  2. [Conclusion / artifact statement] The manuscript states that VERO is released but does not include a permanent artifact link or DOI; this should be added for the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Empirical study (abstract and §4)] The central empirical claim—that certain modifications 'reliably improve target agent performance'—is load-bearing for the paper's contribution yet rests on an analysis whose statistical foundation is not described. No trial counts, per-configuration variance, error bars, exclusion criteria, or hypothesis tests are reported, despite the acknowledged stochasticity of LLM completions inside the target agents. This directly affects the validity of the 'reliably' qualifier.

    Authors: We agree that the statistical foundation of the empirical study in §4 requires explicit description to substantiate the claims. In the revised manuscript we will report the exact number of independent trials run per optimizer configuration, include per-configuration variance and error bars on all performance metrics, specify exclusion criteria for outlier runs, and detail any hypothesis tests used to evaluate whether observed improvements are reliable given the stochasticity of LLM completions. revision: yes

  2. Referee: [§3 (VERO harness definition)] The harness description (versioned snapshots, budget-controlled evaluation, structured traces) is presented at a high level without concrete implementation details or pseudocode for the core loop that interleaves deterministic code with stochastic LLM steps. Without these, it is impossible to verify that the harness actually enforces the reproducibility and budget controls asserted in the abstract.

    Authors: We acknowledge that §3 currently provides only a high-level description. In the revised manuscript we will add concrete implementation details and pseudocode for the core edit-execute-evaluate loop, explicitly documenting how versioned snapshots are created and restored, how evaluation budgets are enforced, how structured traces record both intermediate LLM reasoning and final execution outcomes, and the mechanisms that maintain reproducibility despite stochastic LLM steps. revision: yes

Circularity Check

0 steps flagged

No circularity: harness introduction and empirical demonstration are self-contained

full rationale

The paper presents VERO as a new evaluation harness with versioning, rewards, observations, budget control, and structured traces, then uses it for an empirical comparison of optimizer configurations on target agents and tasks. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is the release of the harness plus a demonstration study; neither reduces by construction to its own inputs, self-citations, or renamed known results. The work is an engineering contribution whose validity rests on external reproducibility of the released code and benchmarks rather than any internal derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that agent optimization requires joint capture of reasoning traces and execution outcomes; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Agent optimization differs fundamentally from conventional software engineering because the target interleaves deterministic code with stochastic LLM completions.
    Explicitly stated in the abstract as the motivation for the new harness.
invented entities (1)
  • VERO harness no independent evidence
    purpose: Provide versioned snapshots, budget control, and structured traces for agent-optimization evaluation.
    Newly defined and released in the paper; no independent evidence outside this work is supplied.

pith-pipeline@v0.9.0 · 5465 in / 1228 out tokens · 37563 ms · 2026-05-15T18:59:33.979445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 10 internal anchors

  1. [1]

    Claude 3.7 Sonnet and Claude Code

    Anthropic. Claude 3.7 Sonnet and Claude Code. URL https://www.anthropic.com/news/ claude-3-7-sonnet

  2. [2]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, and et al. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

  3. [3]

    Building effective enterprise agents

    Boston Consulting Group AI Platforms Group. Building effective enterprise agents. Techni- cal report, Boston Consulting Group, November 2025. URL https://www.bcg.com/assets/2025/ building-effective-enterprise-agents.pdf

  4. [4]

    J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, A. Madry, and L. Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=6s5uXNWGIh

  5. [5]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . de Oliveira Pinto, and et al. Evaluating large language models trained on code, 2021. URLhttps://arxiv.org/abs/2107.03374

  6. [6]

    Cheng, A

    A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, and et al. The facts leaderboard: A comprehensive benchmark for large language model factuality, 2025. URLhttps://arxiv.org/abs/2512.10791

  7. [7]

    Cheng, A

    C.-A. Cheng, A. Nie, and A. Swaminathan. Trace is the new autodiff — unlocking efficient optimization of computational workflows. InAutomated Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs,

  8. [8]

    URLhttps://openreview.net/forum?id=Lg4AnAFsMd

  9. [9]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, and et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    D. Dua, Y. Wang, P . Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota...

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, and et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  12. [12]

    S. Hu, C. Lu, and J. Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=t9U3LW7JVX

  13. [13]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, and et al. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=chfJJYC3iL. 12

  14. [14]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, and et al. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  15. [15]

    Khattab, A

    O. Khattab, A. Singhvi, P . Maheshwari, Z. Zhang, K. Santhanam, and et al. DSPy: Compiling declara- tive language model calls into self-improving pipelines. InThe Twelfth International Conference on Learning Representations, 2024

  16. [16]

    M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, and et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

  17. [17]

    Mialon, C

    G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants,

  18. [18]

    URLhttps://arxiv.org/abs/2311.12983

  19. [19]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P .-S. Huang, and et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025. URLhttps://arxiv.org/abs/2506.13131

  20. [20]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, and et al. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= Ti67584b98

  21. [21]

    Robeyns, M

    M. Robeyns, M. Szummer, and L. Aitchison. A self-improving coding agent, 2025. URL https://arxiv.org/ abs/2504.15228

  22. [22]

    Romera-Paredes, M

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P . Kumar, and et al. Mathematical discoveries from program search with large language models.Nature, 2023. doi: 10.1038/s41586-023-06924-6

  23. [23]

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  24. [24]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, and et al. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. URLhttps://arxiv.org/abs/2407.16741

  25. [25]

    J. Wei, N. Karina, H. W. Chung, Y. J. Jiao, S. Papay, and et al. Measuring short-form factuality in large language models, 2024. URLhttps://arxiv.org/abs/2411.04368

  26. [26]

    L. Weng. Llm powered autonomous agents.lilianweng.github.io, 2023. URL https://lilianweng.github. io/posts/2023-06-23-agent/

  27. [27]

    C. Yang, X. Wang, Y. Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= Bb4VGOWELI

  28. [28]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, and et al. SWE-agent: Agent-computer interfaces enable automated software engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15793

  29. [29]

    J. Yang, N. Yonack, K. Zyskowski, D. Yarats, J. Ho, and J. Ma. The adoption and usage of ai agents: Early evidence from perplexity. Technical Report Working Paper 26-040, Harvard Business School, 2025. URLhttps: //www.hbs.edu/ris/Publication%20Files/26-040_ac431922-9f75-4f7d-b6dc-b67bb1c02c50.pdf

  30. [30]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, and et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  31. [31]

    S. Yao, N. Shinn, P . Razavi, and K. R. Narasimhan.τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=roNSXZpUDN

  32. [32]

    Yin and Z

    L. Yin and Z. Wang. Llm-autodiff: Auto-differentiate any llm workflow, 2025. URL https://arxiv.org/abs/ 2501.16673

  33. [33]

    Yuksekgonul, F

    M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P . Lu, and et al. Optimizing generative ai by backpropagating language model feedback.Nature, 639:609–616, 2025. doi: 10.1038/s41586-025-08661-4. 13

  34. [34]

    Zelikman, E

    E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. InConference on Language Modeling (CoLM), 2024. URL https://arxiv.org/abs/2310. 02304

  35. [35]

    Zhang, J

    J. Zhang, J. Xiang, Z. Yu, F. Teng, X.-H. Chen, and et al. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=z5uVAKwmjf

  36. [36]

    blank slate

    J. Zhang, S. Hu, C. Lu, R. Lange, and J. Clune. Darwin gödel machine: Open-ended evolution of self- improving agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=pUpzQZTvGY. 14 A. Appendix A.1 VERO Scaffold As mentioned in Section 3.3, VERO as a framework is largely agnostic to the codi...