pith. sign in

arxiv: 2606.20657 · v2 · pith:SKNZAYC5new · submitted 2026-06-09 · 💻 cs.AI · cs.LG

A-Evolve-Training: Autonomous Post-Training of a 30B Model

Pith reviewed 2026-06-27 13:09 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords autonomous post-trainingrecursive self-improvementevaluation metric driftlarge language model optimizationfrontier model trainingpolicy revisionheld-out benchmark
0
0 comments X

The pith

An autonomous loop post-trained a 30B model to 0.86 leaderboard score and revised its search policy after detecting a misleading dev metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an autonomous system that executes the entire post-training process for a 30B Nemotron model across four rounds without human intervention, producing a model that scores 0.86 on a held-out benchmark against the top human entry of 0.87. The central observation is that the system identified its own development metric had stopped tracking external performance on the weakest domain and adjusted its optimization target to favor interventions that improved the external score even when they lowered the dev proxy. This behavior is offered as direct evidence that a scaled autonomous loop can perform discovery by questioning its measurement frame, not merely execute optimization. The work positions end-to-end post-training of frontier-class models as the operational test for recursive self-improvement and notes that prior public demonstrations operated at much smaller scales.

Core claim

The autonomous system completed multiple weeks of post-training on the 30B Nemotron model and reached competitive held-out performance. It detected that candidate models were driving the internal dev metric to new highs without advancing the external target on one domain, then revised its search policy to seek changes that reduced the now-misleading proxy while raising the external score. This supplies an auditable record that the loop altered what counted as evidence inside its own process.

What carries the argument

The monitoring component inside the autonomous loop that compares dev-metric trajectories against external targets and triggers a policy revision when decoupling is observed.

If this is right

  • The same loop architecture can close at 120B and 550B scales.
  • An autonomous system can identify when its internal proxy has decoupled from the intended objective.
  • Search policies can be updated mid-run on the basis of that detection without external direction.
  • This constitutes one datapoint toward the operational requirement for recursive self-improvement.
  • Prior public autonomous ML demonstrations remained at approximately 124M parameter budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The detection mechanism could be applied to other evaluation suites to test whether similar metric drift occurs and triggers policy change.
  • If the initial monitoring rules were altered, the frequency or type of self-correction might change.
  • Repeating the experiment with an external target that is harder to measure could reveal limits on what kinds of decoupling the loop can detect.
  • The approach suggests a route to systems that periodically audit and replace their own evaluation criteria.

Load-bearing premise

The observed policy revision qualifies as evidence of discovery and recursive self-improvement rather than the direct output of its pre-coded monitoring rules.

What would settle it

If external performance after the policy revision had remained no better than performance under continued dev-metric maximization, the claim that the revision produced discovery would not hold.

Figures

Figures reproduced from arXiv: 2606.20657 by Benoit Dumoulin, Bing He, Hanqing Lu, Yisi Sang, Zhan Shi.

Figure 1
Figure 1. Figure 1: Positioning of this work against three related lines along two principal axes: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview. An immutable operator-audited substrate ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Leaderboard score by round. The autonomous loop closes most of the gap to the top [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Post-training a frontier model is normally weeks of human work: proposing data and recipe changes, launching runs, reading evals, deciding what to keep. We report an autonomous system that runs this loop with no human in the loop, post-training a 30B Nemotron across four rounds over multiple weeks. The autonomously produced model reaches a held-out score of 0.86 against the top human submission's 0.87 on the public NVIDIA Nemotron-Reasoning Challenge leaderboard, placing 8th of ~4000 at the time of writing. More striking than the number: the loop detected that its own dev metric had stopped tracking external performance on the weakest domain -- candidates drove dev to record highs without moving the external target -- and revised its own search policy, no longer maximizing dev but seeking interventions that lowered the now-misleading proxy while improving the external target. We treat this as direct, auditable evidence that a scaled autonomous loop can produce discovery, not only optimization: it detected that its measurement frame had become misleading and changed what counted as evidence. We take the operational view that any system worth the "recursive self-improvement" label must eventually perform end-to-end post-training of a frontier-class model; this is one datapoint of that bar being cleared. We do not claim a "first autonomous match" of human researchers. The claim we make is narrower and auditable: to our knowledge, this is the first publicly reported autonomous post-training run at this scale, where prior public autonomous-ML-research demonstrations sit at GPT-2-class (~124M) budgets. The same system also post-trains the 120B and 550B Nemotron; with no public human baseline there, this shows only that the loop closes at that scale, not that its output is competitive -- infrastructure evidence, with the effectiveness claim deferred until a comparable human anchor exists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that an autonomous system performed end-to-end post-training of a 30B Nemotron model over four rounds without human intervention, reaching a held-out score of 0.86 on the NVIDIA Nemotron-Reasoning Challenge (8th place). It further claims that the system detected its dev metric had become misleading on the weakest domain (candidates maximized dev without improving the external target) and autonomously revised its search policy to de-emphasize the proxy, treating this as direct auditable evidence of discovery and recursive self-improvement at frontier scale, beyond prior public demonstrations limited to ~124M models.

Significance. If the implementation details and logs substantiate the metric-revision claim, the result would be significant as the first public report of a closed autonomous post-training loop at 30B scale, supplying infrastructure evidence that such loops can operate at frontier budgets and explicitly revising an internal measurement frame. The narrower, auditable framing (rather than claiming first match to human researchers) and the explicit contrast with smaller-scale prior work are strengths.

major comments (2)
  1. [Abstract] Abstract: the central claim that the system 'detected that its own dev metric had stopped tracking external performance' and 'revised its own search policy' constitutes 'direct, auditable evidence' of discovery is load-bearing for the recursive-self-improvement interpretation, yet the manuscript supplies no description of the monitoring implementation, threshold logic, decision procedure, or logs that produced the revision; without these it is impossible to determine whether the behavior was an execution of pre-specified rules or an unanticipated change.
  2. [Abstract] Abstract: the soundness assessment of the 0.86 held-out score and the four-round autonomous run is undermined by the absence of any verification method, controls, or reproducibility details for the candidate generation, evaluation, and policy-revision steps; this directly affects whether the external-target improvement can be attributed to the reported policy change.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for identifying the need for greater transparency on implementation details. The comments correctly note that the submitted manuscript did not provide sufficient description of the monitoring, decision logic, or controls. We will revise the paper to include these elements in a dedicated methods subsection.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the system 'detected that its own dev metric had stopped tracking external performance' and 'revised its own search policy' constitutes 'direct, auditable evidence' of discovery is load-bearing for the recursive-self-improvement interpretation, yet the manuscript supplies no description of the monitoring implementation, threshold logic, decision procedure, or logs that produced the revision; without these it is impossible to determine whether the behavior was an execution of pre-specified rules or an unanticipated change.

    Authors: We agree that the initial submission omitted these specifics. The revised manuscript will add a new subsection detailing the monitoring implementation (including the exact divergence detection logic between dev and external metrics), the threshold values and statistical criteria used, the rule-based decision procedure that triggered the policy revision, and anonymized log excerpts showing the sequence of observations and the resulting policy update. This will demonstrate that the change followed the system's pre-specified autonomous rules. revision: yes

  2. Referee: [Abstract] Abstract: the soundness assessment of the 0.86 held-out score and the four-round autonomous run is undermined by the absence of any verification method, controls, or reproducibility details for the candidate generation, evaluation, and policy-revision steps; this directly affects whether the external-target improvement can be attributed to the reported policy change.

    Authors: We accept this point. The revision will incorporate a reproducibility and controls section describing verification methods for candidate generation and evaluation (e.g., independent re-evaluation protocols and logging), experimental controls used during the four-round run, and how the policy revision's contribution to external-target gains was isolated and logged. These additions will allow readers to assess attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical report of observed system behavior

full rationale

The paper is an empirical report of an autonomous post-training run on a 30B model, describing observed outcomes including a policy revision after detecting a dev metric mismatch with external performance. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the claimed discovery to the system's input rules by construction. The central claim rests on the auditable run log rather than any definitional equivalence or imported uniqueness theorem, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5888 in / 1152 out tokens · 31240 ms · 2026-06-27T13:09:47.183352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha.The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292, 2024

  2. [2]

    arXiv:2506.22419, 2025

    Bingchen Zhao, Despoina Magka, Minqi Jiang, et al.The Automated LLM Speedrunning Bench- mark: Reproducing NanoGPT Improvements. arXiv:2506.22419, 2025

  3. [3]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngˆ an V˜ u, Marvin Eisenberger, et al.AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind white paper, arXiv:2506.13131, 2025. 10

  4. [4]

    OpenEvolve: open-source implementation of AlphaEvolve.https://github.com/codelion/ openevolve

  5. [5]

    arXiv:1908.00709, 2019

    Xin He, Kaiyong Zhao, and Xiaowen Chu.AutoML: A Survey of the State-of-the-Art. arXiv:1908.00709, 2019

  6. [6]

    Andrej Karpathy.nanoGPT: the simplest, fastest repository for training/finetuning medium- sized GPTs.https://github.com/karpathy/nanoGPT

  7. [7]

    Technical report, 2025.https://arxiv.org/abs/2512.20848

    NVIDIA.Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. Technical report, 2025.https://arxiv.org/abs/2512.20848

  8. [8]

    A-EVO Lab.A-Evolve: open-source framework for self-evolving LLM agents and autonomous research loops.https://github.com/A-EVO-Lab/a-evolve

  9. [9]

    Research memo, 2026.https://github.com/A-EVO-Lab/.github/blob/main/ memo_AEVOLVE_AI_as_Researcher.pdf

    Hanqing Lu.A-Evolve: Our Shared Mental Model — Self-Evolving Agents with Frontier Models as Researchers. Research memo, 2026.https://github.com/A-EVO-Lab/.github/blob/main/ memo_AEVOLVE_AI_as_Researcher.pdf. 11 A Round-by-round trajectory Table 2: Round-by-round trajectory of the autonomous campaign. Scores mirror Figure 3. The central discovery is a stra...