LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Alex Zhang; Bernie Wang; Boran Han; Cuixiong Hu; George Karypis; Haoyang Fang; Huzefa Rangwala; Jiading Gai; Peng Tang; Shuai Zhang

arxiv: 2606.18388 · v1 · pith:CKTYKVIQnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI· cs.CL· cs.MA

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Haoyang Fang , Wei Zhu , Boran Han , Alex Zhang , Zhenyu Pan , Shuo Yang , Shuai Zhang , Jiading Gai

show 6 more authors

Peng Tang Cuixiong Hu Xuan Zhu Huzefa Rangwala George Karypis Bernie Wang

This is my paper

Pith reviewed 2026-06-27 01:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.MA

keywords RL post-trainingadaptive training strategiesLLM agentsGRPOparameter schedulingmulti-stage trainingtree search

0 comments

The pith

RL post-training succeeds when capacity parameters increase monotonically while regularization parameters oscillate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that effective RL post-training follows a recurring pattern where capacity parameters accumulate steadily across stages and regularization parameters oscillate to track shifting dynamics. This pattern matters because fixed schedules lock all parameters into unchanging paths and cannot handle the non-stationary exploration-exploitation tradeoffs that regularization parameters must follow. The authors introduce LLMZero, an LLM-agent system that performs tree search over training trajectories, diagnoses issues at checkpoints, and proposes coordinated multi-parameter changes. On four GRPO tasks the discovered strategies deliver large gains over baselines, and the underlying principle appears to transfer across tasks.

Core claim

What carries the argument

LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions.

If this is right

Across four diverse GRPO tasks, the discovered strategies improve over the base model by 9% to 140% relative.
The strategies outperform grid search by 6% to 15% relative and consistently beat random search and skill-based agents.
The structural principle transfers across tasks, explaining why strategies differ in form yet share similar parameter dynamics.
Fixed schedules cannot express non-stationary tradeoffs and therefore underperform adaptive multi-stage rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could build schedulers that explicitly separate capacity-building phases from dynamic regularization adjustments.
The monotonic-versus-oscillatory distinction may appear in optimization settings outside RL post-training.
Automating pathology diagnosis could shrink the need for manually engineered search spaces in training pipelines.
Testing the same search process on larger models or different RL algorithms would check whether the pattern generalizes.

Load-bearing premise

The LLM agents' diagnoses of pathologies and their proposed multi-parameter transitions produce genuine performance gains rather than artifacts of the search process or task-specific biases.

What would settle it

On held-out tasks, strategies discovered by LLMZero fail to outperform grid search or the observed parameter trajectories lack the monotonic capacity growth paired with oscillatory regularization.

Figures

Figures reproduced from arXiv: 2606.18388 by Alex Zhang, Bernie Wang, Boran Han, Cuixiong Hu, George Karypis, Haoyang Fang, Huzefa Rangwala, Jiading Gai, Peng Tang, Shuai Zhang, Shuo Yang, Wei Zhu, Xuan Zhu, Zhenyu Pan.

**Figure 1.** Figure 1: Overview of LLMZERO. The system builds a tree of training trajectories where each node stores a full hyperparameter configuration and resumes from a parent checkpoint, composing multi-stage adaptive strategies via backtracking. At each iteration, the proposer agent analyzes training dynamics (rewards, KL divergence, validation scores, gradient norms) through both text summaries and visual plots, then propo… view at source ↗

**Figure 2.** Figure 2: Test score at the best-validation run so far vs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Best adaptive strategies across all four tasks. Green solid: validation score. Blue dashed: test score. Each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Model scaling on SSMR-Bench (average across 4 subtasks). LLMZERO consistently outperforms baselines across all sizes. Practitioner config failed (OOM) on 8B; LLMZERO autonomously found a working configuration. Per-subtask breakdown in Table 7 (Appendix C.3). space at larger scales. Per-subtask results are in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Best-so-far validation score vs. search iter [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Human-written metric descriptions injected into agent prompts to ground LLM reasoning about training [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Human-written hyperparameter descriptions injected into the proposer prompt, organized by functional [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Proposer agent prompt template. Placeholders are filled with run-specific data at each search iteration. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Early stopper agent prompt template. Invoked every 900 seconds during training. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM agents with tree search find better adaptive RL post-training schedules than grid search and surface a monotonic-capacity versus oscillating-regularization pattern that transfers across tasks.

read the letter

LLMZero uses LLM agents running tree search over training trajectories to discover multi-stage RL post-training strategies, and it reports a recurring pattern: capacity parameters accumulate steadily while regularization parameters oscillate to track shifting dynamics.

The approach is new in letting the LLM diagnose pathologies at checkpoints and propose coordinated parameter changes rather than relying on grid search or random sampling. On four diverse GRPO tasks the discovered strategies improve over the base model by 9-140% relative and over grid search by 6-15% relative, while also beating random search and a skill-based agent. The structural principle is presented as explaining why the strategies differ in form yet share similar dynamics, and it appears to transfer.

The paper does well by turning the observation into actionable design rules that fixed schedules cannot express, because they cannot handle the non-stationary exploration-exploitation tradeoffs that regularization must follow.

The main soft spot is verification. The abstract gives no details on the tree-search implementation, the LLM prompts, or ablations that separate the agent's diagnostic proposals from the benefit of simply exploring more trajectories. Without those, it remains possible that the gains come from broader search volume rather than accurate pathology detection. Relative improvements are reported, but run-to-run variance and statistical tests are not mentioned, so the reliability of the 6-15% edge is hard to judge from what is shown.

This is for researchers working on RL post-training and alignment pipelines who want something more adaptive than hand-tuned or grid-searched schedules. A reader focused on practical training heuristics would get value from the pattern.

It deserves a serious referee. The central empirical pattern is coherent and the outperformance is consistent enough to warrant checking the methods and controls in review.

Referee Report

2 major / 1 minor

Summary. The paper introduces LLMZero, a system in which LLM agents perform tree search over RL post-training trajectories for GRPO tasks. At each checkpoint the agents diagnose pathologies and propose coordinated multi-parameter transitions. From the discovered strategies the authors extract a structural principle: capacity parameters accumulate monotonically across stages while regularization parameters oscillate in response to shifting dynamics. On four diverse GRPO tasks the discovered strategies yield 9–140 % relative gains over the base model and 6–15 % relative gains over grid search, consistently beating random search and a skill-based agent; the same capacity-vs-regularization pattern is observed across tasks.

Significance. If the empirical pattern and transfer claim hold, the work supplies both an automated discovery method for adaptive schedules and an actionable design rule that explains why fixed trajectories are suboptimal for non-stationary exploration–exploitation trade-offs. The consistent outperformance over strong baselines and the cross-task regularity constitute a concrete contribution to RL post-training methodology.

major comments (2)

[Abstract] Abstract: the claim that the structural principle 'transfers across tasks' is load-bearing for the explanatory contribution, yet the abstract provides no explicit cross-task transfer experiment (e.g., applying a schedule discovered on task A to task B without re-running the agent). Without such a test it remains unclear whether the shared parameter dynamics are independently predictive or merely post-hoc observations on the same four runs.
[Abstract] Abstract (results paragraph): the reported 6–15 % gains over grid search and consistent superiority to the skill-based agent are central to the empirical claim, but no information is given on the number of independent runs, standard errors, or statistical tests. In RL post-training, where variance is typically high, these details are required to establish that the observed differences are not attributable to training stochasticity or unequal search budgets.

minor comments (1)

[Abstract] The abstract uses the term 'GRPO tasks' without a brief parenthetical expansion or citation on first use; a short definition would improve accessibility for readers outside the immediate sub-area.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important clarifications needed for the transfer claim and statistical reporting. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the structural principle 'transfers across tasks' is load-bearing for the explanatory contribution, yet the abstract provides no explicit cross-task transfer experiment (e.g., applying a schedule discovered on task A to task B without re-running the agent). Without such a test it remains unclear whether the shared parameter dynamics are independently predictive or merely post-hoc observations on the same four runs.

Authors: The referee is correct that we did not conduct an explicit transfer experiment in which a schedule discovered on one task is applied zero-shot to another task. The evidence in the manuscript consists of the LLM agent independently discovering the same qualitative pattern (monotonic capacity accumulation, oscillating regularization) when run separately on each of the four tasks. We will revise the abstract to state that the principle 'is observed consistently across tasks' rather than claiming transfer, to accurately reflect the reported results without overstating them. revision: partial
Referee: [Abstract] Abstract (results paragraph): the reported 6–15 % gains over grid search and consistent superiority to the skill-based agent are central to the empirical claim, but no information is given on the number of independent runs, standard errors, or statistical tests. In RL post-training, where variance is typically high, these details are required to establish that the observed differences are not attributable to training stochasticity or unequal search budgets.

Authors: We agree that the absence of run counts, standard errors, and statistical tests is a limitation given the known variance in RL post-training. We will add this information to the revised manuscript, reporting results averaged over multiple independent runs with standard errors and appropriate significance tests against the baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports an empirical discovery: LLM agents perform tree search over training trajectories, observe that capacity parameters increase monotonically while regularization parameters oscillate, and note that this pattern transfers across four GRPO tasks with consistent gains over baselines. This observation is presented as emerging from the search outputs rather than being presupposed by the method or by any self-citation. No equations, fitted parameters, or uniqueness theorems are shown to reduce the claimed principle to the search inputs by construction. The central result remains an externally falsifiable empirical pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are identifiable; the work is empirical and agent-based.

pith-pipeline@v0.9.1-grok · 5740 in / 1157 out tokens · 57973 ms · 2026-06-27T01:18:32.925010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 linked inside Pith

[1]

Preprint, arXiv:1711.09846

Population based training of neural networks. Preprint, arXiv:1711.09846. Yunjie Ji, Sitong Zhao, Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, and Xian- gang Li. 2025. How difficulty-aware staged rein- forcement learning enhances llms’ reasoning capa- bilities: A preliminary experimental study.Preprint, arXiv:2504.00829. Zhengyao Jian...

Pith/arXiv arXiv 2025
[2]

Preprint, arXiv:2505.21318

Beyond chemical qa: Evaluating llm’s chem- ical reasoning with modular chemical operations. Preprint, arXiv:2505.21318. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization.Preprint, arXiv:1603.06560. Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin...

arXiv 2018
[3]

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko

Direct preference optimization: Your lan- guage model is secretly a reward model.Preprint, arXiv:2305.18290. Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. 2026. Posttrainbench: Can llm agents automate llm post-training?arXiv preprint arXiv:2603.08640. John Schulman, Filip Wolski, Praf...

Pith/arXiv arXiv 2026
[4]

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang

Practical bayesian optimization of machine learning algorithms.Preprint, arXiv:1206.2944. Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. 2025. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models.Preprint, arXiv:2503.17287. Fanqi Wan, Weizhou Shen, ...

Pith/arXiv arXiv 2025
[5]

Preprint, arXiv:2505.07608

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. Preprint, arXiv:2505.07608. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliber- ate problem solving with large language models. Preprint, arXiv:2305.10601. 10 Chujie Zheng, Shix...

arXiv 2023
[6]

uses MCTS for ML pipeline configuration; 12 AIDE (Jiang et al., 2025) applies tree search to data science competitions; MLZero (Fang et al., 2025) provides end-to-end automation across modalities; and AlphaEvolve (Novikov et al., 2025) applies evolutionary search to code. LLMZEROtargets a fundamentally different search space: RL post- training trajectorie...

2025
[7]

These data-centric approaches complement our dynamics-aware search by targeting dataset optimization rather than training trajectory search

demonstrates that properly scaffolded agents can autonomously compose highly efficient data- selection policies that outperform standard base- lines. These data-centric approaches complement our dynamics-aware search by targeting dataset optimization rather than training trajectory search. LLM post-training methods.Current LLM pipelines utilize a variety ...

2017
[8]

, ck}(Eq

Compute UCT for all existing non-terminal children{c 1, . . . , ck}(Eq. 5). 18
[9]

6) with Qprior =f T (ˆsp) and Nfair =N(p)/(k+ 1)

Compute UCT for the virtual new child (Eq. 6) with Qprior =f T (ˆsp) and Nfair =N(p)/(k+ 1)
[10]

If UCT(new)>max i UCT(ci) and k < kmax: expand (create new child atp)
[11]

This mechanism naturally adapts breadth vs

Otherwise: descend into arg maxi UCT(ci) and repeat. This mechanism naturally adapts breadth vs. depth: when children underperform their parent, the virtual child’s prior wins, triggering exploration of a new transition from the same checkpoint. E Detailed Per-Run Results This section reports the full hyperparameter con- figuration and performance for eve...
[12]

Current step >= 10 (too early to judge before that)
[13]

A run behind the best can still win if its trajectory is steeper; a run ahead can still lose if it is plateauing

The validation score trajectory has no realistic chance of exceeding the best validation score seen so far, considering the improvement rate, not just the current value. A run behind the best can still win if its trajectory is steeper; a run ahead can still lose if it is plateauing. **Only the validation score determines STOP/CONTINUE. ** All other metric...

2038

[1] [1]

Preprint, arXiv:1711.09846

Population based training of neural networks. Preprint, arXiv:1711.09846. Yunjie Ji, Sitong Zhao, Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, and Xian- gang Li. 2025. How difficulty-aware staged rein- forcement learning enhances llms’ reasoning capa- bilities: A preliminary experimental study.Preprint, arXiv:2504.00829. Zhengyao Jian...

Pith/arXiv arXiv 2025

[2] [2]

Preprint, arXiv:2505.21318

Beyond chemical qa: Evaluating llm’s chem- ical reasoning with modular chemical operations. Preprint, arXiv:2505.21318. Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros- tamizadeh, and Ameet Talwalkar. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization.Preprint, arXiv:1603.06560. Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin...

arXiv 2018

[3] [3]

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko

Direct preference optimization: Your lan- guage model is secretly a reward model.Preprint, arXiv:2305.18290. Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, and Maksym Andriushchenko. 2026. Posttrainbench: Can llm agents automate llm post-training?arXiv preprint arXiv:2603.08640. John Schulman, Filip Wolski, Praf...

Pith/arXiv arXiv 2026

[4] [4]

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang

Practical bayesian optimization of machine learning algorithms.Preprint, arXiv:1206.2944. Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. 2025. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models.Preprint, arXiv:2503.17287. Fanqi Wan, Weizhou Shen, ...

Pith/arXiv arXiv 2025

[5] [5]

Preprint, arXiv:2505.07608

Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining. Preprint, arXiv:2505.07608. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliber- ate problem solving with large language models. Preprint, arXiv:2305.10601. 10 Chujie Zheng, Shix...

arXiv 2023

[6] [6]

uses MCTS for ML pipeline configuration; 12 AIDE (Jiang et al., 2025) applies tree search to data science competitions; MLZero (Fang et al., 2025) provides end-to-end automation across modalities; and AlphaEvolve (Novikov et al., 2025) applies evolutionary search to code. LLMZEROtargets a fundamentally different search space: RL post- training trajectorie...

2025

[7] [7]

These data-centric approaches complement our dynamics-aware search by targeting dataset optimization rather than training trajectory search

demonstrates that properly scaffolded agents can autonomously compose highly efficient data- selection policies that outperform standard base- lines. These data-centric approaches complement our dynamics-aware search by targeting dataset optimization rather than training trajectory search. LLM post-training methods.Current LLM pipelines utilize a variety ...

2017

[8] [8]

, ck}(Eq

Compute UCT for all existing non-terminal children{c 1, . . . , ck}(Eq. 5). 18

[9] [9]

6) with Qprior =f T (ˆsp) and Nfair =N(p)/(k+ 1)

Compute UCT for the virtual new child (Eq. 6) with Qprior =f T (ˆsp) and Nfair =N(p)/(k+ 1)

[10] [10]

If UCT(new)>max i UCT(ci) and k < kmax: expand (create new child atp)

[11] [11]

This mechanism naturally adapts breadth vs

Otherwise: descend into arg maxi UCT(ci) and repeat. This mechanism naturally adapts breadth vs. depth: when children underperform their parent, the virtual child’s prior wins, triggering exploration of a new transition from the same checkpoint. E Detailed Per-Run Results This section reports the full hyperparameter con- figuration and performance for eve...

[12] [12]

Current step >= 10 (too early to judge before that)

[13] [13]

A run behind the best can still win if its trajectory is steeper; a run ahead can still lose if it is plateauing

The validation score trajectory has no realistic chance of exceeding the best validation score seen so far, considering the improvement rate, not just the current value. A run behind the best can still win if its trajectory is steeper; a run ahead can still lose if it is plateauing. **Only the validation score determines STOP/CONTINUE. ** All other metric...

2038