pith. machine review for the scientific record. sign in

arxiv: 2604.14877 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningLLM agentstool usecapability boundarypass@kcompositional tasksagentic reasoningsequential information gathering
0
0 comments X

The pith

Reinforcement learning expands the capability boundary of LLM agents on compositional tool-use tasks, unlike on static reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether RL for LLM agents with tools genuinely increases what they can accomplish or only improves reliability. Earlier work on static reasoning found that base and RL models eventually match when given enough independent samples. For agentic tool use, where multiple rounds allow sequential strategies that resampling cannot recover, the authors show RL agents maintain higher success even at large sample budgets. They define a two-dimensional metric to separate true capability growth from efficiency gains. The result holds only for tasks that require composing information across steps; on simpler tasks RL behaves as prior studies predict.

Core claim

Tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. This expansion is specific to compositional, sequential information gathering. Under matched training data, supervised fine-tuning regresses the boundary on the same tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information.

What carries the argument

The PASS@(k,T) metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement.

If this is right

  • On compositional tool-use tasks the RL pass curve stays strictly above the base curve and the separation grows rather than shrinks with more samples.
  • Supervised fine-tuning on identical data lowers the boundary on those same tasks.
  • The RL gain is concentrated in better integration of information retrieved across rounds rather than broader exploration.
  • On simpler non-compositional tasks the RL and base curves behave as in prior static-reasoning studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The metric suggests future agent benchmarks should report families of curves rather than single points to distinguish capability from reliability.
  • If the pattern generalizes, scaling laws for agent performance may require separate terms for RL-induced strategy reweighting.
  • The finding opens the possibility that self-directed exploration during training discovers integration patterns that imitation alone cannot.

Load-bearing premise

The chosen tasks accurately represent compositional sequential information gathering and matched training data comparisons isolate self-directed exploration without other confounding differences in behavior or task distribution.

What would settle it

Observing that the RL and base-model pass curves converge at sufficiently large k on the same compositional tasks, or that supervised fine-tuning produces equivalent boundary expansion under matched data.

Figures

Figures reproduced from arXiv: 2604.14877 by Wenjing Yan, Xiaodan Shao, Xin Wang, Zhiyuan Zhai.

Figure 1
Figure 1. Figure 1: The full PASS@(k, T) landscape: rows are task categories, columns are models. On Category A every row is flat in T (tool unavailable). On Category B the surface saturates at T = 2. On Category C πRL’s panel is uniformly warmer than πbase’s at T ≥ 2, while πSFT’s is cooler; the training signal shifts the entire surface. Category A (no effect). On pure mathematical reasoning, |BRL| = |Bbase| = 84, and |BRL \… view at source ↗
Figure 2
Figure 2. Figure 2: PASS@(k, Tmax) vs. sampling budget k on a log axis. Tmax = 0 for Category A and Tmax = 5 for B and C. On Category C, πbase and πRL cross near k = 4: at k = 1 πbase is slightly ahead, but as k grows πRL pulls above and the gap widens (+4 pp at k = 64), the opposite of the convergence reported by Yue et al. [2025]. πSFT sits below both. (i) Category A is flat in T; pass-curves cluster within ≤ 5 pp with no s… view at source ↗
Figure 3
Figure 3. Figure 3: Per-problem capability analysis and marginal-value diagnostics on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism analysis on Category C. (a) Distributions of πbase’s perplexity on 200 successful πRL trajectories, split into search-query tokens and reasoning tokens. The reasoning distribution (median 2.79) is shifted substantially above the query distribution (median 1.94); RL’s divergence from base is not on what to search but on how to integrate the returned paragraphs. (b) Number of unique search-query se… view at source ↗
Figure 5
Figure 5. Figure 5: shows the ratio ∆T /(∆T + ∆k) on a 2 × 3 grid of heatmaps (rows = Categories B and C, columns = models). Warm cells (ratio > 0.5) mean depth is the more valuable direction at that (k, T) operating point; cool cells (ratio < 0.5) mean sampling is. Cells where both ∆T and ∆k fall below 0.005 are rendered gray (“sat”) to avoid division-by-small-number artifacts. Category A is omitted because ∆T = 0 throughout… view at source ↗
Figure 6
Figure 6. Figure 6: Per-problem PASS@(64, 0) on Category A (MATH-500). The three models’ columns are nearly identical; the capability-set decomposition is symmetric with only-model and only-base counts both near three. S Category C per-T Pass Curves [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-problem PASS@(64, 5) on Category B (comparison questions). πSFT and πRL produce similar capability-set decompositions, each adding ∼ 5 new bridge problems over πbase and losing ∼ 2. 1 2 4 8 16 32 64 k (sampling budget) 0.0 0.2 0.4 0.6 0.8 1.0 P A S S @(k, T) o n C ate g o r y C πbase 1 2 4 8 16 32 64 k (sampling budget) πSFT 1 2 4 8 16 32 64 k (sampling budget) πRL T=0 T=1 T=2 T=3 T=5 [PITH_FULL_IMAGE… view at source ↗
Figure 8
Figure 8. Figure 8: Category C pass-curves by interaction depth [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: πbase’s perplexity on 200 successful Category C trajectories for πSFT and πRL, split into search-query tokens and reasoning tokens. Both PPLsearch and PPLreason are higher for πSFT than for πRL; SFT has displaced the base distribution on both sides, while RL has re-weighted within it. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces PASS@(k,T), a metric jointly varying sampling budget k and interaction depth T, to distinguish capability expansion from efficiency gains in LLM agents. It claims that unlike static reasoning (where base and RL pass@k curves converge), tool-use RL on compositional sequential tasks enlarges the capability boundary: the RL agent's PASS@(k,T) curve lies above the base model's with a widening gap at large k. This effect is absent on simpler tasks; under matched data, SFT regresses the boundary, isolating self-directed exploration; mechanism analysis attributes gains to reweighting toward strategies with better downstream reasoning.

Significance. If the central empirical contrast holds, the work would be significant for clarifying when RL expands versus merely stabilizes LLM agent performance. The PASS@(k,T) framing offers a concrete way to test boundary expansion in interactive settings, the SFT control isolates exploration as causal, and the task-type specificity reconciles optimistic/pessimistic RL views. The paper ships a falsifiable prediction (non-convergence on compositional tasks) and an explicit mechanism claim, both of which are valuable even if the finite-k evidence requires strengthening.

major comments (2)
  1. [Results / PASS@(k,T) analysis] Results section on PASS@(k,T) curves for compositional tasks: the headline claim that 'the gap widens at large k rather than converging' and thereby 'genuinely enlarges the capability boundary' rests on curves observed only up to a finite maximum k. If the RL and base models share the same support of successful trajectories, the curves are required to converge as k→∞; the reported separation could be a finite-sample artifact. The manuscript should either (a) report the largest k tested and show the gap is still increasing, (b) provide an argument or extrapolation that the k→∞ limits differ, or (c) demonstrate that RL changes the support itself.
  2. [Experimental setup] Experimental setup and task definitions: the claim that the effect is 'specific to compositional, sequential information gathering' requires precise, reproducible definitions of the task suite, how 'compositional' is operationalized, and the exact training-data matching procedure between RL and SFT. Without these, it is impossible to rule out post-hoc task selection or confounding differences in model behavior as the source of the observed divergence.
minor comments (3)
  1. [Abstract and Figures] The abstract and main text should explicitly state the maximum k and T values used in the reported curves and whether error bars or multiple seeds are shown.
  2. [Methods] Notation: PASS@(k,T) is introduced as a two-dimensional metric, but the precise mathematical definition (especially how T-interaction paths are sampled and aggregated) should appear in a dedicated methods subsection before the results.
  3. [Mechanism analysis] The mechanism analysis (reweighting toward better downstream reasoning) would benefit from a quantitative breakdown, e.g., a table showing the fraction of probability mass shifted to high-performing trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of the PASS@(k,T) metric in distinguishing capability expansion from efficiency gains. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: Results section on PASS@(k,T) curves for compositional tasks: the headline claim that 'the gap widens at large k rather than converging' rests on curves observed only up to a finite maximum k. If the RL and base models share the same support of successful trajectories, the curves are required to converge as k→∞; the reported separation could be a finite-sample artifact. The manuscript should either (a) report the largest k tested and show the gap is still increasing, (b) provide an argument or extrapolation that the k→∞ limits differ, or (c) demonstrate that RL changes the support itself.

    Authors: We agree this is a substantive point. Our experiments evaluate up to k=128 on the compositional tasks, where the gap between RL and base continues to widen without signs of convergence. To strengthen the claim, we will add (a) explicit reporting of the maximum k tested with the corresponding curves, and (c) a clearer argument that RL changes the support: self-directed exploration during RL enables new compositional trajectories (sequential tool-use strategies) that lie outside the base model's support even at large k, as evidenced by our mechanism analysis showing reweighting toward strategies with superior downstream reasoning integration. This is consistent with the non-convergence prediction for compositional tasks. We will include this in the revised Results section. revision: partial

  2. Referee: Experimental setup and task definitions: the claim that the effect is 'specific to compositional, sequential information gathering' requires precise, reproducible definitions of the task suite, how 'compositional' is operationalized, and the exact training-data matching procedure between RL and SFT. Without these, it is impossible to rule out post-hoc task selection or confounding differences in model behavior as the source of the observed divergence.

    Authors: We agree that explicit definitions are essential for reproducibility. Section 3.1 defines the task suite with concrete examples: compositional tasks require sequential, dependent tool calls (e.g., using the output of one tool as input to the next in information-gathering chains), while simpler tasks involve independent or single-step actions. 'Compositional' is operationalized via dependency graphs in the task construction. The RL-SFT data matching procedure (same trajectory pool, with SFT using supervised imitation and RL using self-exploration) is detailed in Appendix B. We will move key excerpts of these definitions into the main text and add a table summarizing task properties to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical curve comparison is independent of fitted inputs

full rationale

The paper defines PASS@(k,T) directly from sampling budget k and interaction depth T, then reports experimental outcomes on independently trained base, RL, and SFT models evaluated on the same tasks. No derivation step equates a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on a self-citation whose content is itself unverified. The non-convergence observation and mechanism analysis are post-hoc interpretations of measured pass rates, not definitional reductions. The work is self-contained against external benchmarks and receives the default low-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that large-k pass rates measure capability boundaries and that the chosen tasks isolate compositional behavior; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Large-k pass rates on the evaluated tasks represent the model's true capability boundary rather than sampling artifacts.
    This underpins the interpretation that non-convergence indicates genuine expansion.

pith-pipeline@v0.9.0 · 5529 in / 1248 out tokens · 40139 ms · 2026-05-10T11:47:04.686462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages · 14 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y ., et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

  2. [2]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Brown, B., et al. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Chen, M., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    arXiv preprint arXiv:2508.04652 , year=

    Liu, S., et al. LLM collaboration with multi-agent reinforcement learning.arXiv preprint arXiv:2508.04652,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  6. [6]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Feng, J., et al. ReTool: Reinforcement learning for strategic tool use in LLMs.arXiv preprint arXiv:2504.11536,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

  8. [8]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Jin, B., et al. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  9. [9]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

  10. [10]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  11. [11]

    Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettlemoyer, L., and Ribeiro, M. T. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014,

  12. [12]

    Gorilla: Large Language Model Connected with Massive APIs

    Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334,

  13. [13]

    Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

    Putta, P., et al. Agent Q: Advanced reasoning and learning for autonomous AI agents.arXiv preprint arXiv:2408.07199,

  14. [14]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  15. [15]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  16. [16]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  18. [18]

    Solving math word problems with process- and outcome-based feedback

    Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275,

  19. [19]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  20. [20]

    GPT-Fathom: Benchmarking large language models to decipher the evolutionary path towards GPT-4 and beyond.arXiv preprint arXiv:2309.16583,

    Zheng, L., et al. GPT-Fathom: Benchmarking large language models to decipher the evolutionary path towards GPT-4 and beyond.arXiv preprint arXiv:2309.16583,

  21. [21]

    These works target higher benchmark scores under a single reward-weighted objective but do not decompose the observed improvements into capability vs

    12 A Related Work RL for LLM agents and tool use.End-to-end RL over multi-turn tool-use trajectories is a rapidly growing family that includes Agent-R1 [Cheng et al., 2025], ReTool [Feng et al., 2025], Agent-Q [Putta et al., 2024], Search-R1 [Jin et al., 2025], and MAGRPO [Liu et al., 2025b]. These works target higher benchmark scores under a single rewar...

  22. [22]

    Observation

    use this lens to argue that RL on verifiable rewards (RLVR) for mathematical reasoning does not enlarge the base model’s capability set: at large k, base and RL pass@k curves converge. Holistic evaluation frameworks [Liang et al., 2023, Srivastava et al., 2023] and compositional benchmarks [Trivedi et al., 2022, Khot et al., 2023] have broadened the evalu...

  23. [23]

    Write PASS@(k, T) using the unbiased hypergeometric estimator (Def

    Proof. Write PASS@(k, T) using the unbiased hypergeometric estimator (Def. 1): 1− n−cT k / n k . Fix T . Since n−cT k / n k is monotone non-increasing in k for fixed cT (this is the standard pass@k monotonicity), PASS@(k, T) is non-decreasing in k. Now fix k. Any trajectory τ that terminates with the correct answer after at most T1 interaction rounds is a...

  24. [24]

    The data-generation, checkpointing, and evaluation infrastructure is already in place; the extensions require only additional compute, and we will include them in a revised version

    These robustness checks, particularly the temperature sweep, which controls for the possibility that RL training mechanically lowers the output entropy and thereby inflates PASS@(1,T) while deflating PASS@(64,T), are the most important next runs. The data-generation, checkpointing, and evaluation infrastructure is already in place; the extensions require ...