pith. sign in

arxiv: 2606.11119 · v1 · pith:C3QGKOUXnew · submitted 2026-06-09 · 💻 cs.LG · cs.AI· cs.CL

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Pith reviewed 2026-06-27 13:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords agentic reinforcement learningrollout budget allocationreward contrasttree-structured rolloutspolicy optimizationlarge language modelsReAct-style agentsmulti-turn reasoning
0
0 comments X

The pith

TRACE allocates rollout budgets to intermediate prefixes in tree-structured agentic rollouts to raise reward contrast at fixed sampling cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that rollout-intensive policy optimization for language model agents suffers from weak training signals when rewards are assigned only at the end of a full interaction and when prompts produce little variation in outcomes. It proposes modeling each thought-action-observation turn as a distinct node so that budget can be steered not only to entire prompts but also to specific prefixes inside those prompts. A single learned predictor reads the history up to any such prefix and estimates the chance of eventual success, directing samples toward nodes expected to produce both success and failure outcomes. When this allocation runs inside a fixed total budget, the resulting collection of terminal rewards supplies clearer contrast for updating the policy, which the experiments link to accuracy lifts such as 2.8 points on a multi-hop QA task.

Core claim

TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal.

What carries the argument

Tree Rollout Allocation for Contrastive Exploration (TRACE), the mechanism that extends budget decisions from whole prompts to turn-level prefixes inside tree-structured rollouts by using a predictor to target nodes expected to produce mixed terminal rewards.

If this is right

  • Performance and efficiency gains appear on typical agentic benchmarks at equal sampling cost.
  • Qwen3-14B Multi-Hop QA accuracy rises by 2.8 points over competitive baselines.
  • Outcome-only rewards become more informative because allocation favors prefixes with uncertain futures.
  • Adaptive tree structures form that supply stronger gradients for policy updates within the same total rollout count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prefix-level predictor could be reused across different tasks or models without retraining from scratch, lowering the overhead of the method.
  • Applying the tree allocation idea to non-agentic chain-of-thought training might reduce wasted samples on easy or impossible prompts in ordinary reasoning benchmarks.
  • Longer-horizon agent tasks with dozens of turns would likely benefit most, because the number of candidate prefixes grows and the predictor has more opportunities to prune low-contrast branches.

Load-bearing premise

A shared generalizable predictor can accurately estimate conditional success probability at intermediate prefixes from prefix histories and thereby guide allocation to nodes that produce mixed terminal rewards.

What would settle it

Measure the actual rate of mixed terminal rewards among predictor-chosen prefixes versus randomly chosen prefixes in the same rollout trees; if the predictor-selected prefixes do not show reliably higher variance, the allocation advantage disappears.

Figures

Figures reproduced from arXiv: 2606.11119 by Heming Zou, Kai Yang, Lizhou Cai, Qi Wang, Ru Peng, Saiyong Yang, Weijie Liu, Xiangyang Ji, Xin Xu, Yixiu Mao, Yuhang Jiang, Yun Qu.

Figure 1
Figure 1. Figure 1: TRACE redirects a fixed rollout budget toward contrast-rich roots and prefixes, converting scalable outcome rewards into denser mixed-reward contrast and implicit stepwise preference pairs than uniform allocation. The above viewpoint naturally turns flat rollout collection into tree-structured rollouts. Prompt roots are depth-zero anchors and internal prefixes are non-root turn-level anchors, so root-level… view at source ↗
Figure 2
Figure 2. Figure 2: Contrastive-allocation diagnostics. Using rollout results collected over several steps with Qwen3-8B under the Multi-Hop QA (HotpotQA) tree-sampling setting, panel (a) shows that many prompt-root and prefix anchors have empirical success rate pˆ near 0 or 1, where outcome contrast is scarce. Panel (b) measures each anchor’s pair contrast as pˆh (1 − pˆh ); the x-axis is the fraction of rollout budget assig… view at source ↗
Figure 3
Figure 3. Figure 3: Framework overview of TRACE. A prefix value predictor scores prompt roots and visited prefixes, TRACE solves budgeted root allocation and prompt-local prefix expansion, and the resulting rollout trees provide recursive value targets and root-/prefix-level comparisons for tree-aware policy optimization. and sets Vroot(xi , 0) = Vroot(xi , 1) = 0. This is the predicted probability that m root rollouts for xi… view at source ↗
Figure 4
Figure 4. Figure 4: Test accuracy during training. The six panels cover Mathematical Reasoning (DeepScaler), Multi-Hop QA (HotpotQA), and Function Calling (BFCL v4) with Qwen3-8B (top) and Qwen3-14B (bottom). We compare GRPO, PCL, random TreePO allocation, and TRACE under the same rollout￾budget setting. Higher curves indicate stronger final policies under identical sampling budgets. 0 20 40 60 80 Training Steps 0.2 0.3 0.4 0… view at source ↗
Figure 5
Figure 5. Figure 5: Effective ratio during training. The six panels cover Mathematical Reasoning (DeepScaler), Multi-Hop QA (HotpotQA), and Function Calling (BFCL v4) with Qwen3-8B (top) and Qwen3-14B (bottom). The panels compare the fraction of contrastive samples selected by each allocation strategy. Higher values indicate more non-degenerate reward groups per update. rollout budget toward contrast-rich roots and prefixes r… view at source ↗
Figure 6
Figure 6. Figure 6: Prompt-level prediction quality during training. The six panels cover Mathematical Reasoning (DeepScaler), Multi-Hop QA (HotpotQA), and Function Calling (BFCL v4) with Qwen3-8B (top) and Qwen3-14B (bottom). The panels show Spearman’s rank correlation between predicted prompt difficulty and empirical success rate, with the associated significance shown on the right axis. 0 20 40 60 80 Training Steps 0.0 0.1… view at source ↗
Figure 7
Figure 7. Figure 7: Prefix-level prediction quality during training. The six panels cover Mathematical Reasoning (DeepScaler), Multi-Hop QA (HotpotQA), and Function Calling (BFCL v4) with Qwen3-8B (top) and Qwen3-14B (bottom). The panels show Spearman’s rank correlation between predicted prefix difficulty and empirical continuation success, with the associated significance shown on the right axis. attached to activated anchor… view at source ↗
Figure 8
Figure 8. Figure 8: Llama-3.2-3B Multi-Hop QA average accuracy. We report the four-benchmark average over Hot￾potQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. 0 5 10 15 20 25 30 Training Steps 0.4 0.5 0.6 0.7 Effective Ratio Effective Ratio GRPO TreePO TRACE 0 5 10 15 20 25 30 Training Steps 0.0 0.1 0.2 0.3 0.4 0.5 Prompt Correlation Prompt Correlation Prompt Correlation p-value 0.0 0.2 0.4 0.6 0.8 p-value 0 5 10 15 20 25 30 T… view at source ↗
Figure 9
Figure 9. Figure 9: Llama-3.2-3B allocation and predictor diagnostics on Multi-Hop QA (HotpotQA). The panels follow Figures 5–7, showing effective ratio, prompt-level correlation, and prefix-level correlation. Correlation panels show the predictor diagnostics [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training time breakdown on Multi-Hop QA (HotpotQA). We report average wall-clock time for TRACE with Qwen3-8B and Qwen3-14B. The current implementation uses an unoptimized predictor scoring path, while predictor parameter updates remain a small fraction of total runtime. 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 Relative anchor position in bare rollout 0 5 10 15 20 K-… view at source ↗
Figure 11
Figure 11. Figure 11: Stage 2 allocation behavior on Function Calling (BFCL v4). We average over the common late training window. The left panel bins anchors by relative position within their bare rollout and weights each anchor by the assigned continuation count Ki,j,t . The right panel reports the share of the total Stage 2 continuation budget assigned to each absolute anchor depth. TreePO allocation by 3.1 points. The gains… view at source ↗
Figure 12
Figure 12. Figure 12: Stage 1 root allocation summary on Function Calling (BFCL v4). Panel (a) shows the fraction of candidate prompts receiving mi = 0 or mi ≥ 2; by construction TRACE uses mi ∈ {0} ∪ {2, 3, . . .}, so mi = 1 has zero mass. Panel (b) shows active-prompt mi mean, median, and two-standard-deviation range over the same late training window. Exact per-integer counts within the active set require logging the full i… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TRACE, a unified rollout budget allocation framework for efficient agentic RLVR. It models ReAct-style multi-turn rollouts as tree-structured graphs with semantically distinct turn-level nodes, extending allocation beyond prompt roots to intermediate prefixes. A shared generalizable predictor estimates conditional success probability from prefix histories to allocate budget preferentially to nodes likely to produce mixed terminal rewards, thereby enriching outcome-only feedback and amplifying the policy-update signal within a fixed sampling budget. Empirically, TRACE is reported to achieve competitive performance and efficiency gains on agentic benchmarks, including a 2.8-point average accuracy improvement for Qwen3-14B on Multi-Hop QA over competitive baselines at equal sampling cost.

Significance. If the predictor reliably identifies mixed-reward prefixes and the resulting tree allocation demonstrably increases reward contrast without introducing systematic bias, the framework could provide a practical, generalizable improvement in sample efficiency for multi-turn agentic RLVR, addressing limitations of prompt-level allocation methods. The explicit tree modeling of turn-level nodes and the predictor-guided allocation represent a coherent extension of existing rollout strategies.

major comments (2)
  1. [Abstract] Abstract: The claim of a 2.8-point accuracy improvement on Qwen3-14B Multi-Hop QA supplies no experimental protocol, baseline definitions, statistical tests, ablation results, or variance estimates, rendering the reported gain unverifiable against the method's contribution.
  2. [Abstract] Abstract: The shared generalizable predictor that estimates P(success | prefix history) to guide allocation to mixed-reward nodes is load-bearing for the claimed contrast improvement; however, no details on training data, architecture, calibration, or measured error rates are provided. Systematic bias in the predictor would collapse the tree allocation to near-random or prompt-level selection, erasing the reported efficiency gains.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'ReAct-style thought-action-observation turn' is introduced without a citation or one-sentence definition, which may reduce accessibility for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract's conciseness. We agree that the reported gains and the predictor require clearer context for verifiability and will revise the abstract accordingly while preserving its length constraints. Details supporting the claims appear in the experimental sections of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of a 2.8-point accuracy improvement on Qwen3-14B Multi-Hop QA supplies no experimental protocol, baseline definitions, statistical tests, ablation results, or variance estimates, rendering the reported gain unverifiable against the method's contribution.

    Authors: We acknowledge the abstract's brevity omits these elements. The evaluation protocol (including Qwen3-14B, Multi-Hop QA task, equal-cost sampling, and competitive baselines such as uniform rollout allocation), statistical tests, ablations, and variance estimates are detailed in Section 4 and Appendix C. We will revise the abstract to include a brief statement of the evaluation setting and reference to the main results table for verification. revision: yes

  2. Referee: [Abstract] Abstract: The shared generalizable predictor that estimates P(success | prefix history) to guide allocation to mixed-reward nodes is load-bearing for the claimed contrast improvement; however, no details on training data, architecture, calibration, or measured error rates are provided. Systematic bias in the predictor would collapse the tree allocation to near-random or prompt-level selection, erasing the reported efficiency gains.

    Authors: The abstract summarizes the high-level approach. Predictor specifics (training on prefix histories from warm-up rollouts, shared MLP architecture, calibration procedure, and validation error rates) are provided in Section 3.2, with empirical checks against bias in Section 4.3. We will expand the abstract with one sentence on the predictor to address potential concerns about its reliability and role in the allocation. revision: yes

Circularity Check

0 steps flagged

No circularity: allocation rule and predictor are independent of reported gains

full rationale

The abstract and description present TRACE as an empirical allocation procedure that trains a separate predictor on prefix histories to estimate conditional success probabilities, then uses those estimates to select nodes. No equations, derivations, or self-citations are shown that would make the reported 2.8-point gain or contrast improvement reduce by construction to a fitted parameter or input quantity defined within the method itself. The framework is self-contained against external benchmarks with no load-bearing self-referential steps visible.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence and accuracy of a trainable predictor for prefix success probability and on the modeling assumption that mixed terminal rewards at prefixes enrich outcome-only feedback; both are introduced without independent verification in the abstract.

free parameters (1)
  • predictor parameters
    The shared generalizable predictor must be trained or fitted to estimate conditional probabilities from histories.
axioms (1)
  • domain assumption Allocating samples to prefixes expected to yield mixed terminal rewards enriches the policy-update signal
    This premise underpins the claim that the adaptive tree structure amplifies learning from outcome-only rewards.
invented entities (1)
  • Tree-structured rollouts with turn-level nodes no independent evidence
    purpose: To allow budget allocation to extend from prompt roots to intermediate prefixes
    New modeling choice that treats each ReAct turn as a semantically distinct node.

pith-pipeline@v0.9.1-grok · 5840 in / 1424 out tokens · 25816 ms · 2026-06-27T13:53:46.125105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 23 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2412.16720 , year=

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  2. [2]

    Nature , volume=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

  3. [3]

    5: Scaling reinforcement learning with llms , author=

    Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

  4. [4]

    arXiv preprint arXiv:2505.09388 , year=

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  5. [5]

    arXiv preprint arXiv:2402.03300 , year=

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  6. [6]

    arXiv preprint arXiv:1707.06347 , year=

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  7. [7]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  8. [8]

    Notion Blog , volume=

    Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl , author=. Notion Blog , volume=

  9. [9]

    International Conference on Learning Representations , volume=

    Let's verify step by step , author=. International Conference on Learning Representations , volume=

  10. [10]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  11. [11]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    arXiv preprint arXiv:1803.05457 , year=

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  14. [14]

    arXiv preprint arXiv:2311.12022 , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  15. [15]

    Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

    HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

  16. [16]

    Proceedings of the 28th International Conference on Computational Linguistics , pages=

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps , author=. Proceedings of the 28th International Conference on Computational Linguistics , pages=

  17. [17]

    Transactions of the Association for Computational Linguistics , volume=

    MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  18. [18]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Measuring and narrowing the compositionality gap in language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  19. [19]

    arXiv preprint arXiv:2212.03533 , year=

    Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

  20. [20]

    Forty-second International Conference on Machine Learning , year=

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

  21. [21]

    arXiv preprint arXiv:2510.01135 , year=

    Prompt curriculum learning for efficient llm post-training , author=. arXiv preprint arXiv:2510.01135 , year=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    arXiv preprint arXiv:2509.25849 , year=

    Knapsack rl: Unlocking exploration of llms via optimizing budget allocation , author=. arXiv preprint arXiv:2509.25849 , year=

  24. [24]

    arXiv preprint arXiv:2210.03629 , year=

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Cppo: Accelerating the training of group relative policy optimization-based reasoning models , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Online difficulty filtering for reasoning oriented reinforcement learning , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  28. [28]

    arXiv preprint arXiv:2509.21240 , year=

    Tree search for llm agent reinforcement learning , author=. arXiv preprint arXiv:2509.21240 , year=

  29. [29]

    arXiv preprint arXiv:2506.05183 , year=

    Treerpo: Tree relative policy optimization , author=. arXiv preprint arXiv:2506.05183 , year=

  30. [30]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Treerl: Llm reinforcement learning with on-policy tree search , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  31. [31]

    arXiv preprint arXiv:2502.03387 , year=

    LIMO: Less is More for Reasoning , author=. arXiv preprint arXiv:2502.03387 , year=

  32. [32]

    arXiv preprint arXiv:2502.11886 , year=

    Limr: Less is more for rl scaling , author=. arXiv preprint arXiv:2502.11886 , year=

  33. [33]

    arXiv preprint arXiv:2503.10460 , year=

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. arXiv preprint arXiv:2503.10460 , year=

  34. [34]

    arXiv preprint arXiv:2503.24290 , year=

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

  35. [35]

    5-math technical report: Toward mathematical expert model via self-improvement , author=

    Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

  36. [36]

    arXiv preprint arXiv:2504.05185 , year=

    Concise reasoning via reinforcement learning , author=. arXiv preprint arXiv:2504.05185 , year=

  37. [37]

    arXiv preprint arXiv:2504.20571 , year=

    Reinforcement learning for reasoning in large language models with one training example , author=. arXiv preprint arXiv:2504.20571 , year=

  38. [38]

    2025 , eprint=

    Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? , author=. 2025 , eprint=

  39. [39]

    arXiv preprint arXiv:2505.24864 , year=

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models , author=. arXiv preprint arXiv:2505.24864 , year=

  40. [40]

    arXiv preprint arXiv:2502.01456 , year=

    Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

  41. [41]

    CoRR , year=

    Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

  42. [42]

    arXiv preprint arXiv:2504.14286 , year=

    Srpo: A cross-domain implementation of large-scale reinforcement learning on llm , author=. arXiv preprint arXiv:2504.14286 , year=

  43. [43]

    2025 , eprint=

    CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs , author=. 2025 , eprint=

  44. [44]

    arXiv preprint arXiv:2505.14970 , year=

    Self-Evolving Curriculum for LLM Reasoning , author=. arXiv preprint arXiv:2505.14970 , year=

  45. [45]

    arXiv preprint arXiv:2603.10887 , year=

    Dynamics-predictive sampling for active RL finetuning of large reasoning models , author=. arXiv preprint arXiv:2603.10887 , year=

  46. [46]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

  47. [47]

    arXiv preprint arXiv:2305.08291 , year=

    Large language model guided tree-of-thought , author=. arXiv preprint arXiv:2305.08291 , year=

  48. [48]

    arXiv preprint arXiv:2408.03314 , year=

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  49. [49]

    arXiv preprint arXiv:2407.01476 , year=

    Tree search for language model agents , author=. arXiv preprint arXiv:2407.01476 , year=

  50. [50]

    arXiv preprint arXiv:2310.04406 , year=

    Language agent tree search unifies reasoning acting and planning in language models , author=. arXiv preprint arXiv:2310.04406 , year=

  51. [51]

    5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search , author=

    Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search , author=. International Conference on Learning Representations , volume=

  52. [52]

    arXiv preprint arXiv:2309.17179 , year=

    Alphazero-like tree-search can guide large language model decoding and training , author=. arXiv preprint arXiv:2309.17179 , year=

  53. [53]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Advancing process verification for large language models via tree-based preference learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  54. [54]

    arXiv preprint arXiv:2405.00451 , year=

    Monte carlo tree search boosts reasoning via iterative preference learning , author=. arXiv preprint arXiv:2405.00451 , year=

  55. [55]

    outcome reward: Which is better for agentic rag reinforcement learning , author=

    Process vs. outcome reward: Which is better for agentic rag reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  56. [56]

    arXiv preprint arXiv:2406.18629 , year=

    Step-dpo: Step-wise preference optimization for long-chain reasoning of llms , author=. arXiv preprint arXiv:2406.18629 , year=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Iterative tool usage exploration for multimodal agents via step-wise preference tuning , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    Rest-mcts*: Llm self-training via process reward guided tree search , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment , author=

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Segment policy optimization: Effective segment-level credit assignment in rl for large language models , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    International conference on machine learning , pages=

    Curiosity-driven exploration by self-supervised prediction , author=. International conference on machine learning , pages=. 2017 , organization=

  62. [62]

    Advances in neural information processing systems , volume=

    Hindsight experience replay , author=. Advances in neural information processing systems , volume=

  63. [63]

    Icml , volume=

    Policy invariance under reward transformations: Theory and application to reward shaping , author=. Icml , volume=. 1999 , organization=

  64. [64]

    Advances in Neural Information Processing Systems , volume=

    Rudder: Return decomposition for delayed rewards , author=. Advances in Neural Information Processing Systems , volume=

  65. [65]

    Bmj , volume=

    Spearman's rank correlation coefficient , author=. Bmj , volume=. 2014 , publisher=

  66. [66]

    Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

    Dense passage retrieval for open-domain question answering , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

  67. [67]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  68. [68]

    2025 , note=

    rLLM: A Framework for Post-Training Language Agents , author=. 2025 , note=

  69. [69]

    Advances in Neural Information Processing Systems , volume=

    Group-in-group policy optimization for llm agent training , author=. Advances in Neural Information Processing Systems , volume=

  70. [70]

    arXiv preprint arXiv:2603.10848 , year=

    V\_ \ 0.5 \ : Generalist Value Model as a Prior for Sparse RL Rollouts , author=. arXiv preprint arXiv:2603.10848 , year=

  71. [71]

    arXiv preprint arXiv:2602.01970 , year=

    Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models , author=. arXiv preprint arXiv:2602.01970 , year=

  72. [72]

    arXiv preprint arXiv:2603.24840 , year=

    Prune as you generate: Online rollout pruning for faster and better rlvr , author=. arXiv preprint arXiv:2603.24840 , year=

  73. [73]

    arXiv preprint arXiv:2602.03048 , year=

    CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs , author=. arXiv preprint arXiv:2602.03048 , year=

  74. [74]

    arXiv preprint arXiv:2511.18902 , year=

    VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL , author=. arXiv preprint arXiv:2511.18902 , year=

  75. [75]

    arXiv preprint arXiv:2602.14338 , year=

    Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning , author=. arXiv preprint arXiv:2602.14338 , year=

  76. [76]

    Proceedings of the national academy of sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

  77. [77]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Learning without forgetting , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2017 , publisher=

  78. [78]

    arXiv preprint arXiv:2502.01427 , year=

    Structural features of the fly olfactory circuit mitigate the stability-plasticity dilemma in continual learning , author=. arXiv preprint arXiv:2502.01427 , year=

  79. [79]

    Heming Zou and Yunliang Zang and Wutong Xu and Xiangyang Ji , booktitle=. Fly-. 2026 , url=

  80. [80]

    arXiv preprint arXiv:2603.19145 , year=

    Enhancing pretrained model-based continual representation learning via guided random projection , author=. arXiv preprint arXiv:2603.19145 , year=

Showing first 80 references.