arxiv: 2605.05724 · v1 · submitted 2026-05-07 · 💻 cs.MA · cs.AI

Recognition: unknown

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Jingjie Ning , Xiaochuan Li , Ji Zeng , Hao Kang , Chenyan Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:54 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords auto researchspecialist agentslineage feedbacktraining recipesautonomous code editsevaluator outcomesbenchmark improvementclosed empirical loop

0 comments

The pith

Specialist agents in a closed empirical loop use lineage feedback from evaluator outcomes to autonomously generate effective program-level edits that improve training recipes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an automated research process can refine machine learning training recipes through repeated cycles of code proposals, experiments, and adaptation based on measured results. Specialist agents divide the space of possible changes and maintain a shared record of prior trials, including failures such as crashes or accuracy shortfalls. This shared lineage allows later proposals to build on what worked or failed rather than starting from scratch each time. The process runs end-to-end without humans selecting proposals, repairing code, or overriding scores after launch. Measurable gains appear across three separate tasks measured by independent external evaluators.

Core claim

The central claim is that lineage feedback within a submitted-trial loop enables specialist agents to convert evaluator outcomes—including crashes, budget overruns, size failures, and accuracy misses—into later program-level recipe edits rather than isolated suggestions. Across 1,197 headline-run trials, the loop produces auditable trajectories of proposals, code diffs, and scores while applying and combining known techniques inside each environment. The same loop yields a 0.81% reduction in Parameter Golf validation bpb, a 38.7% rise in NanoChat-D12 CORE, and a 4.59% drop in CIFAR-10 Airbench96 wallclock time, all starting from public recipes and without human intervention during search.

What carries the argument

The submitted-trial loop, in which each trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal, with specialist agents partitioning recipe surfaces and sharing measured lineage across trials.

If this is right

The loop autonomously writes code, submits experiments, absorbs feedback, and improves public starting recipes across distinct environments.
The output is a complete auditable trajectory of proposals, diffs, scores, and failure labels rather than a single model checkpoint.
The same submitted-trial structure handles architecture-domain changes such as attention-kernel path rewrites within a strict audit of 157 submissions.
Each task is evaluated by its own external evaluator and legality checks with no human overrides after setup and launch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on larger-scale models or additional domains to determine whether the lineage mechanism continues to surface useful edits beyond the three reported tasks.
If the pattern holds, the method might reduce reliance on manual hyperparameter and recipe tuning in routine machine-learning development.
Combining the loop with broader search techniques could extend autonomous refinement from training recipes to other empirical research steps such as data selection or evaluation design.

Load-bearing premise

Specialist agents can reliably interpret diverse failure signals and lineage history to generate non-trivial, effective program-level code edits that produce generalizable improvements rather than overfitting to the specific trial set or evaluator quirks.

What would settle it

Re-running the identical loop on a fourth, previously unseen task or benchmark and checking whether the edited recipes deliver comparable gains or instead plateau or regress relative to the public starting points.

Figures

Figures reproduced from arXiv: 2605.05724 by Chenyan Xiong, Hao Kang, Jingjie Ning, Ji Zeng, Xiaochuan Li.

**Figure 1.** Figure 1: Closed-loop auto research trajectory. Submitted trials connect proposals, executable edits, view at source ↗

**Figure 2.** Figure 2: Best-so-far score over submitted trial index. Points are valid measured trials only; ineligible view at source ↗

**Figure 3.** Figure 3: Parameter Golf controls over the first 200 trials. The view at source ↗

**Figure 4.** Figure 4: Parameter Golf GLOBAL_RULES, abridged. The full source is in multi_agent_pg/agents/prompts.py. NanoChat-D12 and CIFAR have analogous global rules with task-specific limits (e.g. NC’s 90-minute pretraining cap, CIFAR’s 0.96 accuracy gate). A Specialist prompt templates Each specialist’s session begins with a system prompt assembled from three pieces, in this order: 1. Knowledge files. Static markdown docume… view at source ↗

**Figure 5.** Figure 5: One of ten Parameter Golf domain preambles. Each preamble follows the same three-part view at source ↗

**Figure 6.** Figure 6: The Parameter Golf Optimizer preamble. Compared with Architecture, the scope is narrower (schedules, weight decay, optimizer family) but the same scope/non-scope/edit-radius shape is preserved. A.3 Anti-anchoring and crash feedback Specialists can repeatedly return to high-salience edits, which collapses the proposal stream toward a small set of canonical moves. Each specialist receives a short banlist of … view at source ↗

**Figure 7.** Figure 7: The Meta-Search preamble explicitly frames the role as an analyst who reads existing results before proposing. This is the only role in Parameter Golf that is allowed to lean on the lineage as its primary input rather than as a check on a freshly proposed edit. Per-iteration user message — full lineage. # Session start –- 2026-04-22T01:36:09Z You are specialist arch. Your workdir is <workdir_arch>. Current… view at source ↗

**Figure 8.** Figure 8: Schematic of the user message rendered for each specialist session un view at source ↗

**Figure 9.** Figure 9: Schematic of the user message under the no-lineage ablation ( view at source ↗

**Figure 10.** Figure 10: Specialist role partitioning and search behavior across environments. The top row shows view at source ↗

**Figure 11.** Figure 11: Specialist outcome profiles across the three environments. Each stacked bar is normalized view at source ↗

**Figure 12.** Figure 12: Parameter Golf final recipe schematic. The figure summarizes inherited and rewritten view at source ↗

**Figure 13.** Figure 13: NanoChat-D12 final recipe schematic. The figure summarizes the attention-path rewrite, view at source ↗

**Figure 14.** Figure 14: CIFAR-10 Airbench96 final recipe schematic. The figure summarizes the schedule rewrite, view at source ↗

read the original abstract

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by $0.81\%$, raises NanoChat-D12 CORE by $38.7\%$, and reduces CIFAR-10 Airbench96 wallclock by $4.59\%$, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows specialist agents using shared trial lineage to edit code and improve recipes on three tasks, but lacks the ablation needed to credit the lineage sharing itself.

read the letter

The main takeaway is that this setup lets specialist agents absorb evaluator feedback from crashes, overruns, and misses, then produce program-level edits that yield measurable gains over public starting recipes. They report 0.81% better validation bpb on Parameter Golf, 38.7% higher CORE on NanoChat-D12, and 4.59% lower wallclock on CIFAR-10 Airbench96, all after 1197 headline trials plus controls, with no human intervention once launched and an external audit of the changes.

Referee Report

2 major / 1 minor

Summary. The manuscript describes an auto-research framework in which specialist agents operate a closed empirical loop: each trial consists of a hypothesis, executable code edit, external-evaluator outcome, and lineage feedback that informs subsequent proposals. The central claim is that sharing measured lineage (including crashes, budget overruns, size failures, and accuracy misses) enables agents to produce non-trivial program-level recipe edits rather than one-shot suggestions. Across 1,197 headline trials plus 600 Parameter Golf controls, the loop is reported to improve three public starting recipes—reducing Parameter Golf validation bpb by 0.81 %, raising NanoChat-D12 CORE by 38.7 %, and cutting CIFAR-10 Airbench96 wall-clock by 4.59 %—with no human intervention in proposal selection, editing, or score repair, and with an auditable trace of 157 architecture-domain submissions.

Significance. If the central attribution holds, the work would constitute a concrete demonstration that multi-agent systems can autonomously discover and combine known techniques to improve training recipes on independent external benchmarks. Positive features include the use of external evaluators, the production of fully auditable trajectories of code diffs and failure labels, and the scale of experimentation (nearly 1,800 trials). These elements support reproducibility and transparency claims that are often missing from agentic-AI papers.

major comments (2)

The central empirical claim (Abstract) that lineage feedback enables agents to convert evaluator outcomes into effective program-level edits is not supported by an ablation that isolates history sharing from stateless independent proposals. The manuscript compares results only to the initial public recipes and does not describe a control arm in which the same specialist agents generate proposals without access to prior trial outcomes, failure labels, or shared lineage. Without this control, the reported deltas cannot be unambiguously attributed to lineage feedback rather than to repeated sampling or post-hoc selection of headline runs.
Experimental Results section: the headline percentage improvements (0.81 % bpb, 38.7 % CORE, 4.59 % wall-clock) are presented without reported variance, statistical significance tests, or precise baseline definitions. The 600 Parameter Golf control trials are mentioned but their design (how they differ from the headline loop and whether they isolate the lineage variable) is not specified in sufficient detail to allow assessment of robustness.

minor comments (1)

The abstract states that 'humans did not choose proposals, edit recipes, override scores, or repair failed trials' but does not clarify in the main text what the one-time setup and launch procedure entailed or how the specialist-agent partitioning of recipe surfaces was implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for clearer isolation of lineage feedback and improved statistical reporting. We address each major comment below and will revise the manuscript to strengthen these aspects while preserving the core claims supported by the existing experimental trace.

read point-by-point responses

Referee: The central empirical claim (Abstract) that lineage feedback enables agents to convert evaluator outcomes into effective program-level edits is not supported by an ablation that isolates history sharing from stateless independent proposals. The manuscript compares results only to the initial public recipes and does not describe a control arm in which the same specialist agents generate proposals without access to prior trial outcomes, failure labels, or shared lineage. Without this control, the reported deltas cannot be unambiguously attributed to lineage feedback rather than to repeated sampling or post-hoc selection of headline runs.

Authors: We acknowledge that a direct ablation isolating lineage sharing from stateless proposals would provide stronger causal attribution. The 600 Parameter Golf control trials were run with the same specialist agents but without persistent lineage access across trials (i.e., each proposal generated from the initial recipe only, with no failure labels or prior outcomes shared). However, we agree the manuscript does not describe this distinction with sufficient clarity. In revision we will expand the Experimental Results section with a dedicated subsection detailing the control design, including how stateless runs differ in prompt construction and memory. Given the scale (nearly 1800 total trials), we will also include a post-hoc analysis of proposal quality metrics (e.g., edit novelty and success rate) drawn from the existing logs to compare history-informed vs. stateless behavior, rather than re-running the full set. revision: partial
Referee: Experimental Results section: the headline percentage improvements (0.81 % bpb, 38.7 % CORE, 4.59 % wall-clock) are presented without reported variance, statistical significance tests, or precise baseline definitions. The 600 Parameter Golf control trials are mentioned but their design (how they differ from the headline loop and whether they isolate the lineage variable) is not specified in sufficient detail to allow assessment of robustness.

Authors: We will revise the Experimental Results section to report standard deviations and 95% confidence intervals for all headline deltas, computed across the multiple independent runs within each task. Statistical significance will be assessed via paired t-tests or bootstrap resampling against the initial public recipes, with exact p-values provided. Precise baseline definitions will be added, specifying the exact public starting recipes, evaluator versions, and hardware used for the 0.81%, 38.7%, and 4.59% figures. The 600 control trials will be described in detail, including their stateless design and how they serve as a repeated-sampling baseline. These additions will be placed in a new subsection on robustness and statistical analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external benchmarks

full rationale

The paper reports direct experimental outcomes from 1,197 trials plus controls, using specialist agents in a closed loop with external evaluators on fixed benchmarks (Parameter Golf bpb, NanoChat CORE, CIFAR-10 wallclock). No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The central claim rests on measured deltas against public starting recipes and independent legality/accuracy checks, with no reduction of results to internal definitions or prior author work by construction. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on assumptions about reliable external measurement and agent ability to utilize feedback; no free parameters or invented entities are explicitly introduced in the provided text.

axioms (2)

domain assumption External evaluators supply reliable, unbiased outcomes including crashes, overruns, and performance metrics that can be directly used for recipe edits.
The entire loop depends on these outcomes driving subsequent proposals without human repair.
domain assumption Specialist agents can partition recipe surfaces and combine techniques using shared lineage to produce non-trivial improvements.
This is required for the claim that feedback produces program-level edits rather than one-shot suggestions.

pith-pipeline@v0.9.0 · 5579 in / 1566 out tokens · 65818 ms · 2026-05-08T03:54:52.913122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 3 canonical work pages

[1]

614, nr 7949, s

GitHub repository. Andrej Karpathy. nanoGPT. https://github.com/karpathy/nanoGPT , 2023. GitHub repository. Andrej Karpathy. nanochat: The best ChatGPT that $100 can buy. https://github.com/karpa thy/nanochat, 2025. GitHub repository. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL...

work page doi:10.1038/s41586 2023
[2]

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf

Online challenge description. Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale, 2024. Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andrius...

2024
[3]

Esteban Real, Chen Liang, David R

doi: 10.1609/aaai.v33i01.33014780. Esteban Real, Chen Liang, David R. So, and Quoc V . Le. AutoML-zero: Evolving machine learning algorithms from scratch. InInternational Conference on Machine Learning, 2020. arXiv:2003.03384. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R...

work page doi:10.1609/aaai.v33i01.33014780 2020
[4]

Read LEADERBOARD, KNOWLEDGE.md, and Recent Activity in the user message; identify the current best
[5]

Decide: mutate from the best, or rebase onto a non-best snapshot via rebase_to
[6]

Mutate train_gpt.py via Edit
[7]

Call syntax_check and size_project; fix and retry on failure
[8]

Call submit_trial with a one-sentence hypothesis and a signed expected_delta
[9]

One submit is a complete session; repeat from step 1 only if the returned row points to a specific next edit

Reflect on the result. One submit is a complete session; repeat from step 1 only if the returned row points to a specific next edit. Figure 4: Parameter Golf GLOBAL_RULES, abridged. The full source is in multi_agent_pg/agents/prompts.py. NanoChat-D12 and CIFAR have analogous global rules with task-specific limits (e.g. NC’s 90-minute pretraining cap, CIFA...
[10]

The set is task-specific and fixed before the reported run starts

Knowledge files.Static markdown documents under the task package’s knowledge/ directory, concatenated and pinned at the top of the system prompt so the Anthropic prompt cache can amortise them across sessions. The set is task-specific and fixed before the reported run starts
[11]

Defines hard limits, the tool protocol, and the per-session workflow

Global rules.A task-level protocol shared by every specialist on that task. Defines hard limits, the tool protocol, and the per-session workflow. Figure 4 reproduces the abridged Parameter Golf version
[12]

change one number

Domain preamble.A specialist-specific scope and edit-radius statement. Figures 5, 6, and 7 show three of the ten Parameter Golf preambles; the remaining seven and the NanoChat-D12 / CIFAR preambles follow the same structure (scope + non-scope + edit-radius guidance). The per-iterationuser messageis rendered fresh from the live blackboard at every session ...

2026
[13]

The current-best exp_id + score one-liner is preserved (the agent uses it to rootrebase_to)

Per-iteration prompt rendering.The user message rendered at every session start is short- circuited: the LEADERBOARD.md / KNOWLEDGE.md / Recent Activity / Saturation-warning sections are dropped. The current-best exp_id + score one-liner is preserved (the agent uses it to rootrebase_to). Figure 9 shows the resulting form. 21 Table 8: Full representative s...
[14]

read_snapshot and diff_snapshots are removed from allowed_tools andpreload_tools

Lineage-reading tools. read_snapshot and diff_snapshots are removed from allowed_tools andpreload_tools. rebase_to is preserved because it does not return prior-trialcontentto the agent — it copies code into the workdir using the already-known current-bestexp_id
[15]

zero-shot

Bash reads of blackboard files.The block_bash_blackboard PreToolUse hook (Section B) rejects Bash commands matching any of tree.tsv, results.tsv, lineage_snapshots/, events.jsonl, best.json, supervisor_audit.jsonl, or any path underblackboard/. This closes the in-practice dominant residual: an empirical audit of the lineage-on Parameter Golf run found tha...

work page doi:10.24432/c 2029