SPEAR: Code-Augmented Agentic Prompt Optimization

Cong Feng; Fengyi Li; Guangming Lu; Huimin Han; Mengyin Lu; Shihui Long; Tanvi Motwani; Xiaonan Ding; Yu Sun

arxiv: 2605.26275 · v1 · pith:4CPPPTH6new · submitted 2026-05-25 · 💻 cs.CL

SPEAR: Code-Augmented Agentic Prompt Optimization

Mengyin Lu , Cong Feng , Huimin Han , Guangming Lu , Yu Sun , Xiaonan Ding , Shihui Long , Fengyi Li

show 1 more author

Tanvi Motwani

This is my paper

Pith reviewed 2026-06-29 21:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic prompt engineeringagentic optimizationcode-augmented agentsLLM evaluationprompt optimizationsandbox executionerror analysis

0 comments

The pith

SPEAR lets an agent write and run Python code on evaluation data to analyze prompt errors, winning every industrial LLM judge task on the primary metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPEAR as an agentic system for automatic prompt engineering that gives the optimizer four tools: evaluate, python, set_prompt, and finish. The agent decides autonomously when to use each, with the Python sandbox allowing it to author and execute code that computes confusion matrices, clusters errors, and derives per-group metrics on the current DataFrame. Guardrails of auto-rollback on metric regression and an optional floor turn the long-horizon process into monotone improvement. Experiments across thirteen industrial judge tasks and several benchmarks show SPEAR records the highest score on every primary metric, with the largest gains traced to the Python tool on complex multi-class cases.

Core claim

SPEAR ports the code-as-action paradigm to prompt optimization so the agent itself authors structural error analyses that a long-context LLM cannot reliably extract from raw evaluation DataFrames; the resulting prompts achieve higher agreement and F1 scores on every tested industrial task.

What carries the argument

The Python sandbox tool, which lets the agent write and execute arbitrary Python on the evaluation DataFrame to perform confusion-matrix and error-clustering analyses.

If this is right

Agentic prompt optimizers that can execute code outperform fixed-pipeline APE methods on complex LLM-as-judge tasks.
Auto-rollback on metric regression converts an open-ended agent into a reliable monotone improver.
The Python tool supplies an irreplaceable aggregation step for multi-class error patterns that direct LLM inspection misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Code-execution tools may improve agent loops in other optimization domains that rely on tabular evaluation data.
Extending the sandbox to additional analysis libraries could surface still finer error structures.
Production systems could adopt similar agents to reduce manual prompt iteration cycles.

Load-bearing premise

That a long-context LLM cannot reliably compute class-pair confusion aggregations and similar structural statistics directly from the raw evaluation DataFrame.

What would settle it

Disable the Python tool in SPEAR and measure whether the reported gains of roughly 0.79 kappa on the 5-class tool-selection judge and 0.35 kappa on the hardest extraction dimension disappear.

Figures

Figures reproduced from arXiv: 2605.26275 by Cong Feng, Fengyi Li, Guangming Lu, Huimin Han, Mengyin Lu, Shihui Long, Tanvi Motwani, Xiaonan Ding, Yu Sun.

**Figure 1.** Figure 1: SPEAR architecture. A GPT-5.4 agent brain is grounded by an optional policy document and chooses one of four tools per step: python (executes arbitrary code in an AST-restricted sandbox with the current eval DataFrame in scope), evaluate (scores the current prompt on train or valid via a task-model inference pass), set_prompt (replaces the system prompt), and finish (terminates). Tool outputs (Results) f… view at source ↗

**Figure 2.** Figure 2: Per-method convergence on the three industrial tasks. Each method is plotted in its own panel because the optimization- internal score and budget unit differ. Per panel: x-axis is the method’s native budget unit (SPEAR: full- split evaluate calls; GEPA: metric_calls, i.e. per-row scores within the Pareto loop; TextGrad: step×(batch + val_eval)). Y-axis is the metric the optimizer’s loop actually observes… view at source ↗

**Figure 3.** Figure 3: SPEAR agent trace on the Hiring Assistant [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($\kappa$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $\kappa$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($\Delta \approx +0.79\kappa$ on the 5-class tool-selection judge, $\Delta \approx +0.35\kappa$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPEAR adds a Python sandbox to an agentic prompt optimizer and shows gains on industrial judge tasks, but the ablation leaves the 'LLM cannot extract aggregates' claim untested.

read the letter

The new element is letting the prompt optimizer itself write and run Python on the evaluation dataframe to build confusion matrices and per-group metrics. This extends CodeAct-style tool use into APE in a way the cited baselines do not describe.

The work reports clear wins on three industrial LLM-judge suites and on BBH-7, with the Python tool ablation producing the largest drops on the complex judge tasks. That is concrete and worth noting.

The soft spot is the missing control for the central claim. The paper states the sandbox is irreplaceable because long-context LLMs cannot reliably compute the same structural aggregates from the raw dataframe. The ablation removes the tool but does not include an arm where the LLM receives the full dataframe in context and is explicitly asked to produce the same matrices and clusters. Without that comparison the performance gap could trace to prompt design or agent scaffolding rather than an inherent long-context limitation.

The numbers are presented with external baselines and no obvious circularity, but the abstract-only view leaves data splits and statistical details opaque.

This is for readers working on agentic prompt systems or LLM judges that need structured error analysis. It is worth a serious referee because the method is explicit, the tasks are practical, and the ablation points to a specific lever even if one control is still needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes SPEAR, an agentic automatic prompt engineering system that augments an LLM optimizer with four tools (evaluate, python, set_prompt, finish) and uses a Python sandbox for the agent to author and execute structural error analyses (confusion matrices, clustering, per-group metrics) on evaluation DataFrames. Guardrails (auto-rollback on regression, optional guard metric floor) enforce monotone improvement. It reports that SPEAR outperforms published baselines (GEPA, TextGrad) on all 13 industrial LLM-as-judge tasks and on BBH-7/GSM8K, with ablations attributing the largest gains on complex judge tasks to the Python tool.

Significance. If the empirical claims hold after addressing the control-experiment gap, the work provides concrete evidence that code-augmented agents can outperform pure LLM-based prompt optimizers on structured error analysis tasks that are central to industrial LLM-as-judge pipelines. The monotone-improvement guardrails and free-form tool-use policy are reusable design elements that could influence subsequent agentic APE systems.

major comments (2)

[Abstract] Abstract: The central claim that the Python tool's 'irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame' is load-bearing for the ablation interpretation, yet the reported ablations (tool removal yielding Δ ≈ +0.79κ on tool-selection) contain no control arm in which the LLM is given the full raw DataFrame in context and explicitly prompted to compute and reason over the identical structural statistics (confusion matrices, error clustering, per-group metrics). Without this arm the performance gap cannot be attributed to an inherent long-context limitation rather than prompt design or default agent behavior.
[Experiments] Experiments (industrial tasks and BBH results): The manuscript reports consistent wins on the primary metric across all tasks but does not state whether the same number of optimization steps, evaluation budget, or prompt templates were used for the GEPA and TextGrad baselines; if the baselines were run under a more restricted protocol the magnitude of the reported gaps (e.g., 0.938 vs 0.628 accuracy on BBH-7) cannot be interpreted as evidence of architectural superiority.

minor comments (2)

[Abstract] The abstract lists three industrial suites but does not name the exact 13 judge tasks or the primary metric per task; a table or explicit enumeration would improve clarity.
Reproducibility: No statement is made about public release of the agent code, prompt templates, or the exact DataFrames used for the industrial suites; adding this would strengthen the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important gaps in experimental controls and protocol documentation. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the Python tool's 'irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame' is load-bearing for the ablation interpretation, yet the reported ablations (tool removal yielding Δ ≈ +0.79κ on tool-selection) contain no control arm in which the LLM is given the full raw DataFrame in context and explicitly prompted to compute and reason over the identical structural statistics (confusion matrices, error clustering, per-group metrics). Without this arm the performance gap cannot be attributed to an inherent long-context limitation rather than prompt design or default agent behavior.

Authors: We agree that a control arm in which the LLM optimizer is explicitly given the full raw evaluation DataFrame and prompted to compute the same structural statistics would provide stronger evidence that the performance gap stems from an inherent limitation of long-context reasoning rather than prompt design. The existing ablation removes the Python tool entirely and shows large drops (Δ ≈ +0.79κ on tool-selection), indicating that the agent's ability to author and execute custom analyses is critical. We will add the requested control experiment in the revision to better isolate the contribution of code-augmented analysis. revision: yes
Referee: [Experiments] Experiments (industrial tasks and BBH results): The manuscript reports consistent wins on the primary metric across all tasks but does not state whether the same number of optimization steps, evaluation budget, or prompt templates were used for the GEPA and TextGrad baselines; if the baselines were run under a more restricted protocol the magnitude of the reported gaps (e.g., 0.938 vs 0.628 accuracy on BBH-7) cannot be interpreted as evidence of architectural superiority.

Authors: GEPA and TextGrad were run under the identical protocol as SPEAR, using the same number of optimization steps, evaluation budget, and prompt templates. This was done to isolate architectural differences. We will add an explicit statement confirming these matched conditions in the Experiments section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external baselines and ablations

full rationale

The paper reports empirical performance of an agentic optimizer on industrial and benchmark tasks, comparing against externally published baselines (GEPA, TextGrad) and using ablations on tool removal. No equations, fitted parameters, or self-citations appear in the provided text that reduce any performance claim or uniqueness assertion to a quantity defined by the authors' own prior work. The central premise about the Python tool's contribution is presented as an empirical observation from ablations rather than a self-definitional or fitted-input reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard LLM evaluation practices plus the domain assumption that code execution supplies analytical capabilities unavailable to language-only reasoning; no free parameters or invented entities are visible in the abstract.

free parameters (1)

guard metric floor
Optional parameter mentioned as one of the two guardrails that enforce monotone improvement.

axioms (1)

domain assumption Auto-rollback on metric regression together with the guard metric floor converts the long-horizon agent into a monotone-improving optimizer.
Explicitly stated in the abstract as the mechanism that stabilizes the free-form agent.

pith-pipeline@v0.9.1-grok · 5893 in / 1389 out tokens · 55299 ms · 2026-06-29T21:31:57.055345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Trace is the next AutoDiff: Generative opti- mization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

work page arXiv
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2023. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797. Balaji Dinesh Gangireddi, Aniketh Garikaparthi, Man- asi Patwardhan, an...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand

LongLLMLingua: Accelerating and enhanc- ing LLMs in long context scenarios via prompt com- pression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand. Association for Computational Linguistics. Jiajie Li, Huayi Zhang, Peng Lin, Jinjun Xiong, and Wei Xu. ...

work page arXiv 2025
[4]

gradient descent

VISTA: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831. Multi-agent iterative prompt optimization for video generation; tournament-based selection plus multi- dimensional (visual / audio / contextual) critique. Krista Opsahl-Ong, Arnav Singhvi, Omar Khattab, Kyle Shrivastava, Siddharth Heidari, Matei Zaharia, and Christop...

work page arXiv 2024
[5]

TextGrad: Automatic "Differentiation" via Text

Large language models as optimizers. InIn- ternational Conference on Learning Representations (ICLR). Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR). Mert Yuksekgonul, Federico Bianchi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Trace is the next AutoDiff: Generative opti- mization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

work page arXiv

[2] [2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. 2023. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797. Balaji Dinesh Gangireddi, Aniketh Garikaparthi, Man- asi Patwardhan, an...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand

LongLLMLingua: Accelerating and enhanc- ing LLMs in long context scenarios via prompt com- pression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand. Association for Computational Linguistics. Jiajie Li, Huayi Zhang, Peng Lin, Jinjun Xiong, and Wei Xu. ...

work page arXiv 2025

[4] [4]

gradient descent

VISTA: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831. Multi-agent iterative prompt optimization for video generation; tournament-based selection plus multi- dimensional (visual / audio / contextual) critique. Krista Opsahl-Ong, Arnav Singhvi, Omar Khattab, Kyle Shrivastava, Siddharth Heidari, Matei Zaharia, and Christop...

work page arXiv 2024

[5] [5]

TextGrad: Automatic "Differentiation" via Text

Large language models as optimizers. InIn- ternational Conference on Learning Representations (ICLR). Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR). Mert Yuksekgonul, Federico Bianchi...

work page internal anchor Pith review Pith/arXiv arXiv 2023