Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

Abhijit Kumar; Mohit Suley; Zoey Wu

arxiv: 2605.27788 · v1 · pith:IWOZDCNHnew · submitted 2026-05-27 · 💻 cs.LG · cs.CL

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

Abhijit Kumar , Zoey Wu , Mohit Suley This is my paper

Pith reviewed 2026-06-29 13:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learningtool usecredit assignmentlarge language modelssegment-level rewardscompetence estimation

0 comments

The pith

CARL assigns credit to segments at tool-use boundaries so LLMs learn when their parametric knowledge suffices versus when to call tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARL, which trains a critic on the model's own rollouts to evaluate competence at each tool-use boundary. Rollouts are split at natural delimiters such as code fences, and each segment receives an advantage derived from the single final binary outcome. This produces appropriately signed updates that discourage erroneous or unnecessary tool calls while rewarding helpful ones. On arithmetic, multi-hop QA, and financial-table benchmarks the approach raises exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over prior RL methods, while cutting tool calls on parametrically solvable questions by 53 percent. Gains are largest for the smaller model, consistent with the claim that explicit competence signals matter more when parametric memory is limited.

Core claim

CARL decomposes each rollout at natural tool-use boundaries and trains a critic that assigns independent advantages to segments from a single binary outcome reward, allowing the policy to learn both when to invoke tools and when to rely on internal parameters without step-level supervision.

What carries the argument

A competence-aware critic that produces segment-level advantages by evaluating each delimited portion of a rollout against the final binary outcome.

If this is right

The model issues 53 percent fewer tool calls on questions it can answer from parameters alone while remaining about 10 exact-match points more accurate.
Exact-match gains reach +8.3 points at 7B and +9.0 points at 3B on the Musique multi-hop QA benchmark.
The learned critic separates parametrically solvable from tool-dependent questions with AUC 0.93 at the 7B scale.
Relative improvement is 1.4 times larger at 3B than at 7B, indicating the method compensates for smaller parametric capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-based decomposition could be tested on other structured generation tasks that contain clear delimiters, such as function calling or code generation.
Reducing unnecessary tool invocations may lower inference latency and external API costs in deployed systems.
The critic's competence signal might be reused at inference time as an explicit uncertainty estimate to decide tool use without further training.

Load-bearing premise

Splitting rollouts at natural tool-use boundaries supplies independent segments whose credit can be estimated from one overall success signal.

What would settle it

An ablation that replaces the natural tool-use boundaries with random cuts and obtains the same accuracy and tool-use reductions would falsify the necessity of those boundaries.

Figures

Figures reproduced from arXiv: 2605.27788 by Abhijit Kumar, Mohit Suley, Zoey Wu.

**Figure 1.** Figure 1: Rollout pipeline in CARL. Each tool-use trajectory decomposes into three segment types (invoke, assimilate, commit) with structurally observable boundaries. After invoke, the rollout loop executes the code and captures stdout; after assimilate, it replaces the raw stdout with the model’s <context> block. The critic Vϕ evaluates the context at each boundary; the advantage of each segment is the change in Vϕ… view at source ↗

**Figure 2.** Figure 2: V (s0) calibration drives faster training and emergent selectivity. (a) Calibration of V (s0) under three warm-up regimes (Appendix H). (b) Held-out accuracy across 500 PPO steps (4-dataset average). CARL surpasses Search-R1 PPO at step ∼100 (7B) / ∼150 (3B). (c) Tier 2 tool-use rate. CARL peaks then settles (∼46% at 7B); Search-R1 PPO climbs monotonically toward 84–88%. not transfer: it regresses on FinQA… view at source ↗

**Figure 3.** Figure 3: Qualitative analysis of the CARL critic on a compositional query. The same prompt elicits distinct critic behaviour at each scale (V (s0): 3B = 0.32, 7B = 0.71). (i) A search raises V at 3B but not 7B, which already knows the answer. (ii) Faithful extraction raises V ; a unit-error extraction lowers it, showing opposing signs within one trajectory. (iii) At 7B, calculator use earns V ↑ while mental math ea… view at source ↗

**Figure 4.** Figure 4: Distribution of V (s0) for Tier 1 (tool-dependent) and Tier 2 (parametrically solvable) questions after full warm-up. The two distributions are well-separated at both scales (AUC 0.93 at 7B, 0.85 at 3B), confirming that the critic learns to distinguish questions requiring tool use from those the model can answer on its own [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARL shows segment-level credit assignment via a rollout-trained critic can cut unnecessary tool calls while lifting accuracy, with larger relative gains on smaller models, but the independence of those segments from one binary reward is the part that needs the most scrutiny.

read the letter

The core idea is training a critic on the model's own trajectories so it can assign signed advantages to segments split at code fences and context transitions, all from a single binary outcome per rollout. This avoids needing step-level labels or external judges for tool-use decisions.

The results are the strongest part. On the five benchmarks the abstract reports 6.7 EM gain at 7B and 9.7 at 3B over the best RL baseline, with the 3B model seeing 1.4 times the relative lift. The 53% drop in tool calls on parametrically solvable questions while holding accuracy is a practical win, and the critic reaching 0.93 AUC on separating solvable from tool-dependent cases is a clean side benefit.

The method itself is a straightforward extension of advantage estimation to natural boundaries rather than a wholly new algorithm. That keeps it simple and reproducible in principle.

The soft spot is exactly the one the stress-test flags. Decomposing at those boundaries assumes the segments are additively separable enough that one binary reward can disentangle credit without confounding from sequential effects—an early unnecessary call can change what later segments see. The abstract gives no ablations on alternative segmentations or checks for dependence, so it is hard to tell how much the reported gains rely on the specific choice of boundaries versus the critic learning something more general. If the full paper has those controls or shows the advantages remain stable under different splits, the claim strengthens; without them the numbers could partly reflect lucky segmentation.

This is worth sending to referees. People working on agentic systems or efficient tool use at modest scale will want to see the details, and the empirical pattern is clear enough to justify the time even if revisions are needed on the credit-separation point.

Referee Report

2 major / 1 minor

Summary. The paper proposes CARL (Competence-Aware Reinforcement Learning), which trains a critic on an LLM's own rollouts to perform segment-level credit assignment for tool use. By decomposing trajectories at natural boundaries such as code fences and context transitions, the method assigns independent signed advantages to each segment from a single binary trajectory outcome, without step-level labels or external judges. This is claimed to penalize unnecessary or erroneous tool calls while rewarding necessary ones. Experiments across five benchmarks (arithmetic, multi-hop QA, financial tables) report exact-match gains of 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, a 53% reduction in tool calls on parametrically solvable questions, and a critic AUC of 0.93 for distinguishing competence.

Significance. If the central claim holds, CARL offers a practical way to improve both accuracy and efficiency in tool-augmented LLMs by learning the boundary of parametric knowledge, with larger relative benefits at smaller scales. The use of the model's own rollouts for critic training and the reported scale-dependent gains are notable strengths. The approach could influence credit-assignment techniques in agentic RL more broadly if the segment independence assumption is validated.

major comments (2)

[method / abstract] The core assumption that natural-boundary decomposition yields independent segment credits from one binary reward (abstract and method description) is load-bearing for both the accuracy gains and the 53% tool-call reduction. Sequential dependencies (e.g., an early erroneous call changing context for later segments) could entangle advantages; the manuscript should supply either a formal separability argument or controlled ablations demonstrating that the critic recovers per-segment competence without confounding from policy-induced segment distributions.
[experiments / results] Table or results section reporting the 6.7 / 9.7 EM gains and 53% reduction: the abstract states concrete improvements over RL baselines but provides no error bars, data-split details, or ablation isolating the segment-level critic from standard RL components. This makes it impossible to confirm the gains are attributable to the proposed credit assignment rather than unstated implementation choices.

minor comments (1)

[method] Notation for the critic (e.g., how segment advantages are computed from the binary outcome) should be made fully explicit with an equation, as the current description leaves the exact form of the advantage estimator ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential broader impact of segment-level credit assignment. We address each major comment below and outline revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [method / abstract] The core assumption that natural-boundary decomposition yields independent segment credits from one binary reward (abstract and method description) is load-bearing for both the accuracy gains and the 53% tool-call reduction. Sequential dependencies (e.g., an early erroneous call changing context for later segments) could entangle advantages; the manuscript should supply either a formal separability argument or controlled ablations demonstrating that the critic recovers per-segment competence without confounding from policy-induced segment distributions.

Authors: We agree that the independence assumption is central and that sequential dependencies could in principle entangle advantages. The manuscript motivates natural boundaries (code fences, context transitions) precisely because they align with points where the policy's information state changes, but we lack a formal separability proof. In revision we will add (1) a short discussion of the assumption and its potential violations, and (2) a controlled ablation that fixes the policy and resamples segment distributions to isolate whether the critic recovers per-segment competence independent of policy-induced correlations. These additions will directly address the concern about confounding. revision: yes
Referee: [experiments / results] Table or results section reporting the 6.7 / 9.7 EM gains and 53% reduction: the abstract states concrete improvements over RL baselines but provides no error bars, data-split details, or ablation isolating the segment-level critic from standard RL components. This makes it impossible to confirm the gains are attributable to the proposed credit assignment rather than unstated implementation choices.

Authors: We acknowledge that the reported point estimates lack error bars, explicit split details, and a dedicated ablation isolating the critic. The current numbers come from single runs on fixed splits; we will revise the results section to report means and standard deviations over three random seeds, document the exact train/validation/test partitions, and add an ablation that compares CARL against a trajectory-level RL baseline sharing all other implementation choices (optimizer, reward scaling, rollout length) except the segment-level critic. This will make attribution clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation relies on standard RL credit assignment applied to segments defined at natural boundaries (code fences, context transitions) within the model's own generated rollouts, with a critic trained to produce per-segment advantages from a single binary trajectory reward. No equations or claims reduce a reported prediction or result to a fitted parameter by construction, nor does any load-bearing premise rest on a self-citation chain whose validity is internal to the present work. The empirical gains (EM improvements, tool-call reduction) are presented as outcomes of this procedure on external benchmarks rather than tautological re-expressions of inputs, making the chain self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that natural boundaries exist and suffice for segmentation, plus the introduction of a competence critic whose only reported support is the AUC number in the abstract.

axioms (1)

domain assumption Natural tool-use boundaries exist in rollouts and can be used to segment trajectories for credit assignment.
The method depends on identifying these boundaries like code fences to decompose the rollout.

invented entities (1)

Competence-aware critic no independent evidence
purpose: To learn where parametric knowledge suffices versus needing external help.
The critic is trained on the model's rollouts but no external validation or independent evidence is provided beyond the reported AUC 0.93.

pith-pipeline@v0.9.1-grok · 5899 in / 1259 out tokens · 71635 ms · 2026-06-29T13:47:30.266215+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Arjona-Medina, J.A., et al. (2019). RUDDER: Return Decomposition for Delayed Rewards. NeurIPS

2019
[2]

Kumar, A., Kumar, N., & Gupta, S. (2026). Execution-Grounded Credit Assignment for GRPO in Code Generation. arXiv:2603.16158

work page arXiv 2026
[3]

Taparia, A., et al. (2026). ARC: Learning to Configure Agentic AI Systems. arXiv:2602.11574

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Zhang, Y . (2025). Agent-as-Tool: A Study on the Hierarchical Decision Making with Reinforce- ment Learning. arXiv:2507.01489

work page arXiv 2025
[5]

Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., & Chen, W. (2025). HICRA: Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning. arXiv:2509.03646

work page arXiv 2025
[6]

Jin, B., et al. (2025). Search-R1: Training LLMs to Reason and Leverage Search Engines with RL. COLM 2025

2025
[7]

Li, X., Zou, H., & Liu, P. (2025). ToRL: Scaling Tool-Integrated RL. arXiv:2503.23383

work page arXiv 2025
[8]

Ng, A.Y ., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations. ICML

1999
[9]

Qian, C., et al. (2025). ToolRL: Reward is All Tool Learning Needs. NeurIPS 2025

2025
[10]

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR

2016
[11]

Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Sutton, R.S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in RL. Artificial Intelligence 112, 181-211

1999
[13]

Setlur, A., et al. (2025). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. ICLR 2025

2025
[14]

Guo, Y ., Xu, L., Liu, J., Ye, D., & Qiu, S. (2025). Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models. NeurIPS 2025

2025
[15]

Kazemnejad, A., et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. arXiv:2410.01679

work page arXiv 2024
[16]

Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Yang, Z., et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP

2018
[18]

Ho, X.N., Duong Nguyen, A.K., Sugawara, S., & Aizawa, A. (2020). Constructing A Multi-Hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING

2020
[19]

Chen, Z., et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. EMNLP

2021
[20]

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop Questions via Single Hop Question Composition. TACL

2022
[21]

Chen, J., et al. (2025). ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv:2503.19470

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Li, Y ., et al. (2025). R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning. arXiv:2505.23794

work page arXiv 2025
[23]

Zheng, R., Dou, S., Gao, S., Hua, Y ., Shen, W., et al. (2023). Secrets of RLHF in Large Language Models Part I: PPO. arXiv:2307.04964

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Hu, J., et al. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. arXiv:2405.11143. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Yu, Q., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System. arXiv:2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023

2023
[27]

Yu, Y ., et al. (2024). StepTool: A Step-Level Reinforcement Learning Method for Tool Learning. arXiv:2410.07745. 11 A SMDP Derivation Details From the SMDP Bellman equation.By Theorem 1 of Sutton et al. (12), our setting is an SMDP over segment-level decision points, with per-segment TD error δSMDP k =r k +γ τk V(s k+1)−V(s k),(4) where rk is the cumulat...

work page arXiv 2024
[28]

Tier 1 / Tier 2 labeling via multi-rollout consistency.A question is Tier 2 (within parametric competence) if the base model answers it correctly in at least one of five no-tool rollouts, and Tier 1 (beyond parametric competence) if all five rollouts produce incorrect answers. This labeling directly measures the model’s competence boundary: Tier 2 questio...
[29]

Controlled tool-use rollouts.Each question is also rolled out under two system prompts: (a) forced tool use, (b) no tools allowed. Combined with the Tier labels, this produces four outcome buckets per question, each teaching a different signal: Tier 2 no-tool (anchors V(s 0)≈1 ), Tier 2 forced-tool (teaches that unnecessary tools add risk), Tier 1 no-tool...
[30]

Mixed retrieval quality.Roughly 70% BM25 (matching PPO training) and 30% rollouts from a high-quality internal search API give Vϕ contrast between helpful and less-helpful tool outputs without changing the inference-time retriever
[31]

Topic diversity via embedding clustering.Questions are embedded with a sentence-transformer (all-MiniLM-L6-v2) and clustered with k-means; warm-up samples are drawn uniformly across clusters to preventV ϕ from learning surface topical cues
[32]

Verification gate.Before starting PPO, three checks must pass on a held-out subset

Multi-hop exposure.Multi-hop rollouts from 2WikiMQA and Musique are included so that Vϕ has seen compound-state boundaries before PPO. Verification gate.Before starting PPO, three checks must pass on a held-out subset. Minimum thresholds (gate):(i) V(s 0) separates Tier 1 from Tier 2 questions (AUC ≥0.70 ), (ii) Vϕ shows correct sign behavior after retrie...
[33]

Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}

No-tool direct answer:“Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}.”
[34]

Use search(query) for retrieval or write arithmetic code

Forced tool use:“You must use a Python code block to help answer this question. Use search(query) for retrieval or write arithmetic code. After seeing tool output, write a <context> block extracting the relevant information, then provide your answer inside \boxed{}.”
[35]

You may optionally use

Optional tool use (Tier 2):Same as forced tool use but with “You may optionally use” replacing “You must use.”
[36]

This question may require multiple search steps. You may call tools more than once

Multi-hop forced tool use:Same as (2) but with an additional instruction: “This question may require multiple search steps. You may call tools more than once.” Embedding clustering.Questions are embedded with all-MiniLM-L6-v2 (384-dimensional embeddings). We apply k-means with k= 50 clusters for HotpotQA and 2WikiMQA, k= 30 for GSM8K (smaller and more hom...

2048

[1] [1]

Arjona-Medina, J.A., et al. (2019). RUDDER: Return Decomposition for Delayed Rewards. NeurIPS

2019

[2] [2]

Kumar, A., Kumar, N., & Gupta, S. (2026). Execution-Grounded Credit Assignment for GRPO in Code Generation. arXiv:2603.16158

work page arXiv 2026

[3] [3]

Taparia, A., et al. (2026). ARC: Learning to Configure Agentic AI Systems. arXiv:2602.11574

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Zhang, Y . (2025). Agent-as-Tool: A Study on the Hierarchical Decision Making with Reinforce- ment Learning. arXiv:2507.01489

work page arXiv 2025

[5] [5]

Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., & Chen, W. (2025). HICRA: Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning. arXiv:2509.03646

work page arXiv 2025

[6] [6]

Jin, B., et al. (2025). Search-R1: Training LLMs to Reason and Leverage Search Engines with RL. COLM 2025

2025

[7] [7]

Li, X., Zou, H., & Liu, P. (2025). ToRL: Scaling Tool-Integrated RL. arXiv:2503.23383

work page arXiv 2025

[8] [8]

Ng, A.Y ., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations. ICML

1999

[9] [9]

Qian, C., et al. (2025). ToolRL: Reward is All Tool Learning Needs. NeurIPS 2025

2025

[10] [10]

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR

2016

[11] [11]

Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Sutton, R.S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in RL. Artificial Intelligence 112, 181-211

1999

[13] [13]

Setlur, A., et al. (2025). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. ICLR 2025

2025

[14] [14]

Guo, Y ., Xu, L., Liu, J., Ye, D., & Qiu, S. (2025). Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models. NeurIPS 2025

2025

[15] [15]

Kazemnejad, A., et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. arXiv:2410.01679

work page arXiv 2024

[16] [16]

Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Yang, Z., et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP

2018

[18] [18]

Ho, X.N., Duong Nguyen, A.K., Sugawara, S., & Aizawa, A. (2020). Constructing A Multi-Hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING

2020

[19] [19]

Chen, Z., et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. EMNLP

2021

[20] [20]

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop Questions via Single Hop Question Composition. TACL

2022

[21] [21]

Chen, J., et al. (2025). ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv:2503.19470

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Li, Y ., et al. (2025). R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning. arXiv:2505.23794

work page arXiv 2025

[23] [23]

Zheng, R., Dou, S., Gao, S., Hua, Y ., Shen, W., et al. (2023). Secrets of RLHF in Large Language Models Part I: PPO. arXiv:2307.04964

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Hu, J., et al. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. arXiv:2405.11143. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Yu, Q., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System. arXiv:2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023

2023

[27] [27]

Yu, Y ., et al. (2024). StepTool: A Step-Level Reinforcement Learning Method for Tool Learning. arXiv:2410.07745. 11 A SMDP Derivation Details From the SMDP Bellman equation.By Theorem 1 of Sutton et al. (12), our setting is an SMDP over segment-level decision points, with per-segment TD error δSMDP k =r k +γ τk V(s k+1)−V(s k),(4) where rk is the cumulat...

work page arXiv 2024

[28] [28]

Tier 1 / Tier 2 labeling via multi-rollout consistency.A question is Tier 2 (within parametric competence) if the base model answers it correctly in at least one of five no-tool rollouts, and Tier 1 (beyond parametric competence) if all five rollouts produce incorrect answers. This labeling directly measures the model’s competence boundary: Tier 2 questio...

[29] [29]

Controlled tool-use rollouts.Each question is also rolled out under two system prompts: (a) forced tool use, (b) no tools allowed. Combined with the Tier labels, this produces four outcome buckets per question, each teaching a different signal: Tier 2 no-tool (anchors V(s 0)≈1 ), Tier 2 forced-tool (teaches that unnecessary tools add risk), Tier 1 no-tool...

[30] [30]

Mixed retrieval quality.Roughly 70% BM25 (matching PPO training) and 30% rollouts from a high-quality internal search API give Vϕ contrast between helpful and less-helpful tool outputs without changing the inference-time retriever

[31] [31]

Topic diversity via embedding clustering.Questions are embedded with a sentence-transformer (all-MiniLM-L6-v2) and clustered with k-means; warm-up samples are drawn uniformly across clusters to preventV ϕ from learning surface topical cues

[32] [32]

Verification gate.Before starting PPO, three checks must pass on a held-out subset

Multi-hop exposure.Multi-hop rollouts from 2WikiMQA and Musique are included so that Vϕ has seen compound-state boundaries before PPO. Verification gate.Before starting PPO, three checks must pass on a held-out subset. Minimum thresholds (gate):(i) V(s 0) separates Tier 1 from Tier 2 questions (AUC ≥0.70 ), (ii) Vϕ shows correct sign behavior after retrie...

[33] [33]

Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}

No-tool direct answer:“Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}.”

[34] [34]

Use search(query) for retrieval or write arithmetic code

Forced tool use:“You must use a Python code block to help answer this question. Use search(query) for retrieval or write arithmetic code. After seeing tool output, write a <context> block extracting the relevant information, then provide your answer inside \boxed{}.”

[35] [35]

You may optionally use

Optional tool use (Tier 2):Same as forced tool use but with “You may optionally use” replacing “You must use.”

[36] [36]

This question may require multiple search steps. You may call tools more than once

Multi-hop forced tool use:Same as (2) but with an additional instruction: “This question may require multiple search steps. You may call tools more than once.” Embedding clustering.Questions are embedded with all-MiniLM-L6-v2 (384-dimensional embeddings). We apply k-means with k= 50 clusters for HotpotQA and 2WikiMQA, k= 30 for GSM8K (smaller and more hom...

2048