pith. sign in

arxiv: 2605.27788 · v1 · pith:IWOZDCNHnew · submitted 2026-05-27 · 💻 cs.LG · cs.CL

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

Pith reviewed 2026-06-29 13:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reinforcement learningtool usecredit assignmentlarge language modelssegment-level rewardscompetence estimation
0
0 comments X

The pith

CARL assigns credit to segments at tool-use boundaries so LLMs learn when their parametric knowledge suffices versus when to call tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARL, which trains a critic on the model's own rollouts to evaluate competence at each tool-use boundary. Rollouts are split at natural delimiters such as code fences, and each segment receives an advantage derived from the single final binary outcome. This produces appropriately signed updates that discourage erroneous or unnecessary tool calls while rewarding helpful ones. On arithmetic, multi-hop QA, and financial-table benchmarks the approach raises exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over prior RL methods, while cutting tool calls on parametrically solvable questions by 53 percent. Gains are largest for the smaller model, consistent with the claim that explicit competence signals matter more when parametric memory is limited.

Core claim

CARL decomposes each rollout at natural tool-use boundaries and trains a critic that assigns independent advantages to segments from a single binary outcome reward, allowing the policy to learn both when to invoke tools and when to rely on internal parameters without step-level supervision.

What carries the argument

A competence-aware critic that produces segment-level advantages by evaluating each delimited portion of a rollout against the final binary outcome.

If this is right

  • The model issues 53 percent fewer tool calls on questions it can answer from parameters alone while remaining about 10 exact-match points more accurate.
  • Exact-match gains reach +8.3 points at 7B and +9.0 points at 3B on the Musique multi-hop QA benchmark.
  • The learned critic separates parametrically solvable from tool-dependent questions with AUC 0.93 at the 7B scale.
  • Relative improvement is 1.4 times larger at 3B than at 7B, indicating the method compensates for smaller parametric capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary-based decomposition could be tested on other structured generation tasks that contain clear delimiters, such as function calling or code generation.
  • Reducing unnecessary tool invocations may lower inference latency and external API costs in deployed systems.
  • The critic's competence signal might be reused at inference time as an explicit uncertainty estimate to decide tool use without further training.

Load-bearing premise

Splitting rollouts at natural tool-use boundaries supplies independent segments whose credit can be estimated from one overall success signal.

What would settle it

An ablation that replaces the natural tool-use boundaries with random cuts and obtains the same accuracy and tool-use reductions would falsify the necessity of those boundaries.

Figures

Figures reproduced from arXiv: 2605.27788 by Abhijit Kumar, Mohit Suley, Zoey Wu.

Figure 1
Figure 1. Figure 1: Rollout pipeline in CARL. Each tool-use trajectory decomposes into three segment types (invoke, assimilate, commit) with structurally observable boundaries. After invoke, the rollout loop executes the code and captures stdout; after assimilate, it replaces the raw stdout with the model’s <context> block. The critic Vϕ evaluates the context at each boundary; the advantage of each segment is the change in Vϕ… view at source ↗
Figure 2
Figure 2. Figure 2: V (s0) calibration drives faster training and emergent selectivity. (a) Calibration of V (s0) under three warm-up regimes (Appendix H). (b) Held-out accuracy across 500 PPO steps (4-dataset average). CARL surpasses Search-R1 PPO at step ∼100 (7B) / ∼150 (3B). (c) Tier 2 tool-use rate. CARL peaks then settles (∼46% at 7B); Search-R1 PPO climbs monotonically toward 84–88%. not transfer: it regresses on FinQA… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative analysis of the CARL critic on a compositional query. The same prompt elicits distinct critic behaviour at each scale (V (s0): 3B = 0.32, 7B = 0.71). (i) A search raises V at 3B but not 7B, which already knows the answer. (ii) Faithful extraction raises V ; a unit-error extraction lowers it, showing opposing signs within one trajectory. (iii) At 7B, calculator use earns V ↑ while mental math ea… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of V (s0) for Tier 1 (tool-dependent) and Tier 2 (parametrically solvable) questions after full warm-up. The two distributions are well-separated at both scales (AUC 0.93 at 7B, 0.85 at 3B), confirming that the critic learns to distinguish questions requiring tool use from those the model can answer on its own [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CARL (Competence-Aware Reinforcement Learning), which trains a critic on an LLM's own rollouts to perform segment-level credit assignment for tool use. By decomposing trajectories at natural boundaries such as code fences and context transitions, the method assigns independent signed advantages to each segment from a single binary trajectory outcome, without step-level labels or external judges. This is claimed to penalize unnecessary or erroneous tool calls while rewarding necessary ones. Experiments across five benchmarks (arithmetic, multi-hop QA, financial tables) report exact-match gains of 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, a 53% reduction in tool calls on parametrically solvable questions, and a critic AUC of 0.93 for distinguishing competence.

Significance. If the central claim holds, CARL offers a practical way to improve both accuracy and efficiency in tool-augmented LLMs by learning the boundary of parametric knowledge, with larger relative benefits at smaller scales. The use of the model's own rollouts for critic training and the reported scale-dependent gains are notable strengths. The approach could influence credit-assignment techniques in agentic RL more broadly if the segment independence assumption is validated.

major comments (2)
  1. [method / abstract] The core assumption that natural-boundary decomposition yields independent segment credits from one binary reward (abstract and method description) is load-bearing for both the accuracy gains and the 53% tool-call reduction. Sequential dependencies (e.g., an early erroneous call changing context for later segments) could entangle advantages; the manuscript should supply either a formal separability argument or controlled ablations demonstrating that the critic recovers per-segment competence without confounding from policy-induced segment distributions.
  2. [experiments / results] Table or results section reporting the 6.7 / 9.7 EM gains and 53% reduction: the abstract states concrete improvements over RL baselines but provides no error bars, data-split details, or ablation isolating the segment-level critic from standard RL components. This makes it impossible to confirm the gains are attributable to the proposed credit assignment rather than unstated implementation choices.
minor comments (1)
  1. [method] Notation for the critic (e.g., how segment advantages are computed from the binary outcome) should be made fully explicit with an equation, as the current description leaves the exact form of the advantage estimator ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential broader impact of segment-level credit assignment. We address each major comment below and outline revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [method / abstract] The core assumption that natural-boundary decomposition yields independent segment credits from one binary reward (abstract and method description) is load-bearing for both the accuracy gains and the 53% tool-call reduction. Sequential dependencies (e.g., an early erroneous call changing context for later segments) could entangle advantages; the manuscript should supply either a formal separability argument or controlled ablations demonstrating that the critic recovers per-segment competence without confounding from policy-induced segment distributions.

    Authors: We agree that the independence assumption is central and that sequential dependencies could in principle entangle advantages. The manuscript motivates natural boundaries (code fences, context transitions) precisely because they align with points where the policy's information state changes, but we lack a formal separability proof. In revision we will add (1) a short discussion of the assumption and its potential violations, and (2) a controlled ablation that fixes the policy and resamples segment distributions to isolate whether the critic recovers per-segment competence independent of policy-induced correlations. These additions will directly address the concern about confounding. revision: yes

  2. Referee: [experiments / results] Table or results section reporting the 6.7 / 9.7 EM gains and 53% reduction: the abstract states concrete improvements over RL baselines but provides no error bars, data-split details, or ablation isolating the segment-level critic from standard RL components. This makes it impossible to confirm the gains are attributable to the proposed credit assignment rather than unstated implementation choices.

    Authors: We acknowledge that the reported point estimates lack error bars, explicit split details, and a dedicated ablation isolating the critic. The current numbers come from single runs on fixed splits; we will revise the results section to report means and standard deviations over three random seeds, document the exact train/validation/test partitions, and add an ablation that compares CARL against a trajectory-level RL baseline sharing all other implementation choices (optimizer, reward scaling, rollout length) except the segment-level critic. This will make attribution clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation relies on standard RL credit assignment applied to segments defined at natural boundaries (code fences, context transitions) within the model's own generated rollouts, with a critic trained to produce per-segment advantages from a single binary trajectory reward. No equations or claims reduce a reported prediction or result to a fitted parameter by construction, nor does any load-bearing premise rest on a self-citation chain whose validity is internal to the present work. The empirical gains (EM improvements, tool-call reduction) are presented as outcomes of this procedure on external benchmarks rather than tautological re-expressions of inputs, making the chain self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that natural boundaries exist and suffice for segmentation, plus the introduction of a competence critic whose only reported support is the AUC number in the abstract.

axioms (1)
  • domain assumption Natural tool-use boundaries exist in rollouts and can be used to segment trajectories for credit assignment.
    The method depends on identifying these boundaries like code fences to decompose the rollout.
invented entities (1)
  • Competence-aware critic no independent evidence
    purpose: To learn where parametric knowledge suffices versus needing external help.
    The critic is trained on the model's rollouts but no external validation or independent evidence is provided beyond the reported AUC 0.93.

pith-pipeline@v0.9.1-grok · 5899 in / 1259 out tokens · 71635 ms · 2026-06-29T13:47:30.266215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Arjona-Medina, J.A., et al. (2019). RUDDER: Return Decomposition for Delayed Rewards. NeurIPS

  2. [2]

    Kumar, A., Kumar, N., & Gupta, S. (2026). Execution-Grounded Credit Assignment for GRPO in Code Generation. arXiv:2603.16158

  3. [3]

    Taparia, A., et al. (2026). ARC: Learning to Configure Agentic AI Systems. arXiv:2602.11574

  4. [4]

    Zhang, Y . (2025). Agent-as-Tool: A Study on the Hierarchical Decision Making with Reinforce- ment Learning. arXiv:2507.01489

  5. [5]

    Wang, H., Xu, Q., Liu, C., Wu, J., Lin, F., & Chen, W. (2025). HICRA: Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning. arXiv:2509.03646

  6. [6]

    Jin, B., et al. (2025). Search-R1: Training LLMs to Reason and Leverage Search Engines with RL. COLM 2025

  7. [7]

    Li, X., Zou, H., & Liu, P. (2025). ToRL: Scaling Tool-Integrated RL. arXiv:2503.23383

  8. [8]

    Ng, A.Y ., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations. ICML

  9. [9]

    Qian, C., et al. (2025). ToolRL: Reward is All Tool Learning Needs. NeurIPS 2025

  10. [10]

    Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR

  11. [11]

    Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347

  12. [12]

    Sutton, R.S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in RL. Artificial Intelligence 112, 181-211

  13. [13]

    Setlur, A., et al. (2025). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. ICLR 2025

  14. [14]

    Guo, Y ., Xu, L., Liu, J., Ye, D., & Qiu, S. (2025). Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models. NeurIPS 2025

  15. [15]

    Kazemnejad, A., et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. arXiv:2410.01679

  16. [16]

    Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

  17. [17]

    Yang, Z., et al. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP

  18. [18]

    Ho, X.N., Duong Nguyen, A.K., Sugawara, S., & Aizawa, A. (2020). Constructing A Multi-Hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING

  19. [19]

    Chen, Z., et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. EMNLP

  20. [20]

    Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop Questions via Single Hop Question Composition. TACL

  21. [21]

    Chen, J., et al. (2025). ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv:2503.19470

  22. [22]

    Li, Y ., et al. (2025). R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning. arXiv:2505.23794

  23. [23]

    Zheng, R., Dou, S., Gao, S., Hua, Y ., Shen, W., et al. (2023). Secrets of RLHF in Large Language Models Part I: PPO. arXiv:2307.04964

  24. [24]

    Hu, J., et al. (2024). OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework. arXiv:2405.11143. 10

  25. [25]

    Yu, Q., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System. arXiv:2503.14476

  26. [26]

    Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023

  27. [27]

    Yu, Y ., et al. (2024). StepTool: A Step-Level Reinforcement Learning Method for Tool Learning. arXiv:2410.07745. 11 A SMDP Derivation Details From the SMDP Bellman equation.By Theorem 1 of Sutton et al. (12), our setting is an SMDP over segment-level decision points, with per-segment TD error δSMDP k =r k +γ τk V(s k+1)−V(s k),(4) where rk is the cumulat...

  28. [28]

    Tier 1 / Tier 2 labeling via multi-rollout consistency.A question is Tier 2 (within parametric competence) if the base model answers it correctly in at least one of five no-tool rollouts, and Tier 1 (beyond parametric competence) if all five rollouts produce incorrect answers. This labeling directly measures the model’s competence boundary: Tier 2 questio...

  29. [29]

    Controlled tool-use rollouts.Each question is also rolled out under two system prompts: (a) forced tool use, (b) no tools allowed. Combined with the Tier labels, this produces four outcome buckets per question, each teaching a different signal: Tier 2 no-tool (anchors V(s 0)≈1 ), Tier 2 forced-tool (teaches that unnecessary tools add risk), Tier 1 no-tool...

  30. [30]

    Mixed retrieval quality.Roughly 70% BM25 (matching PPO training) and 30% rollouts from a high-quality internal search API give Vϕ contrast between helpful and less-helpful tool outputs without changing the inference-time retriever

  31. [31]

    Topic diversity via embedding clustering.Questions are embedded with a sentence-transformer (all-MiniLM-L6-v2) and clustered with k-means; warm-up samples are drawn uniformly across clusters to preventV ϕ from learning surface topical cues

  32. [32]

    Verification gate.Before starting PPO, three checks must pass on a held-out subset

    Multi-hop exposure.Multi-hop rollouts from 2WikiMQA and Musique are included so that Vϕ has seen compound-state boundaries before PPO. Verification gate.Before starting PPO, three checks must pass on a held-out subset. Minimum thresholds (gate):(i) V(s 0) separates Tier 1 from Tier 2 questions (AUC ≥0.70 ), (ii) Vϕ shows correct sign behavior after retrie...

  33. [33]

    Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}

    No-tool direct answer:“Answer the following question directly. Do not use any tools or code. Provide your answer inside \boxed{}.”

  34. [34]

    Use search(query) for retrieval or write arithmetic code

    Forced tool use:“You must use a Python code block to help answer this question. Use search(query) for retrieval or write arithmetic code. After seeing tool output, write a <context> block extracting the relevant information, then provide your answer inside \boxed{}.”

  35. [35]

    You may optionally use

    Optional tool use (Tier 2):Same as forced tool use but with “You may optionally use” replacing “You must use.”

  36. [36]

    This question may require multiple search steps. You may call tools more than once

    Multi-hop forced tool use:Same as (2) but with an additional instruction: “This question may require multiple search steps. You may call tools more than once.” Embedding clustering.Questions are embedded with all-MiniLM-L6-v2 (384-dimensional embeddings). We apply k-means with k= 50 clusters for HotpotQA and 2WikiMQA, k= 30 for GSM8K (smaller and more hom...