arxiv: 2605.05262 · v1 · submitted 2026-05-06 · 📊 stat.ML · cs.AI· cs.LG

Recognition: unknown

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

Yuelin Hu , Zhenbo Yu , Zhengxue Cheng , Wei Liu , Li Song

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:11 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords submodular maximizationtree searchagentic reinforcement learningtool-use agentsrollout selectionpolicy gradientuncertainty bounds

0 comments

The pith

Viewing rollout state selection as monotone submodular maximization gives a greedy selector with 1-1/e guarantee and closed-form UUCB scores for fixed-budget agentic RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that budget-agnostic sampling for tool-use rollouts in reinforcement learning collapses on hard prompts, wasting gradient mass no matter the compute allocated. Formalizing rollout informativeness as the expected non-vanishing policy-gradient contribution to GRPO, the authors show that intermediate state choice can be recast as a monotone submodular maximization problem. A greedy one-step selector then inherits the standard 1-1/e approximation while delivering uncertainty-aware upper confidence bound scores as exact marginal gains. The resulting InfoTree framework, augmented by an adaptive budget allocator and speculative expansion, raises mixed-outcome ratios and outperforms flat and prior tree-based GRPO variants across math, web-search, and coding agent benchmarks.

Core claim

Any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts regardless of the budget. Recasting intermediate state selection as a monotone submodular maximization problem yields a greedy one-step selector with a 1-1/e approximation guarantee. The uncertainty-aware upper confidence bound terms arise directly as closed-form marginal gains of this objective, converting an empirical entropy bonus into an analytic consequence. The InfoTree training-time tree-search framework couples this selector with a learned adaptive budget allocator and asynchronous speculative expansion, delivering measurable gains on nine benchmarks while remaining orthogonal to a

What carries the argument

The monotone submodular objective defined on expected non-vanishing policy-gradient mass from a rollout set, whose one-step marginal gains supply the closed-form UUCB selection rule.

If this is right

The adaptive budget allocator rescues prompts whose initial trees produce uniform outcomes, lifting the mixed-outcome ratio from 58.1 percent to 76.3 percent with less than 5 percent extra budget.
Speculative expansion tolerates bounded staleness in UUCB scores and cuts wall-clock overhead from 14.3 percent to 4.8 percent.
The selector operates orthogonally to rollout reuse and trajectory re-weighting, so head-to-head composition with Tree-GRPO prefix sharing and CW-GRPO weights yields further gains.
A 5 by 5 by 5 robustness grid shows more than three quarters of the hyperparameter space lies on a performance plateau.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The submodular reduction may extend to other fixed-compute sequential sampling problems in which informativeness must be maximized without prior knowledge of difficulty.
Closed-form marginal gains could reduce the need for manual entropy tuning across related policy-gradient methods.
Gains on math reasoning, web agents, and tool-rich coding tasks suggest the approach generalizes to broader classes of tool-augmented reasoning agents.

Load-bearing premise

The chosen objective measuring rollout informativeness is monotone and submodular, and the derived UUCB scores exactly equal its marginal gains.

What would settle it

A concrete counterexample in which the greedy selector on a hard-prompt set falls materially below the 1-1/e performance bound, or in which the UUCB scores deviate from independently computed marginal gains of the objective, would falsify the reduction.

Figures

Figures reproduced from arXiv: 2605.05262 by Li Song, Wei Liu, Yuelin Hu, Zhenbo Yu, Zhengxue Cheng.

**Figure 1.** Figure 1: End-to-end overview of INFOTREE. The header formalizes the RIFB problem and the collapse-persistence result. The five stages trace the pipeline from collapse (Stage 1) through the submodular informativeness objective and UUCB selector (Stages 2–3) to the Adaptive Budget Allocator (Stage 4) and hierarchical credit assignment (Stage 5). The receipt strip anchors each stage to a concrete empirical number, and… view at source ↗

**Figure 2.** Figure 2: Empirical validation of Theorem 1 on 500 prompts from AIME 2024 and MATH-500. The view at source ↗

**Figure 3.** Figure 3: End-to-end walkthrough of the AIME 2025 problem discussed in Appendix G. Each view at source ↗

read the original abstract

We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts regardless of the budget. Motivated by this, we recast intermediate state selection as a monotone submodular maximization problem, where a greedy one-step selector enjoys a 1 minus 1/e approximation guarantee. Our Uncertainty-aware Upper Confidence Bound (UUCB) terms arise as closed-form marginal gains of this objective. This turns the token-level entropy bonus from an empirical trick into an analytic consequence of the formulation. We present InfoTree, a training-time tree-search framework coupling UUCB with a learned Adaptive Budget Allocator (ABA) and an asynchronous Speculative Expansion scheme. ABA rescues prompts whose initial tree is wasted on uniform outcomes, lifting the mixed-outcome ratio from 58.1 percent to 76.3 percent with less than 5 percent budget overhead. Speculative Expansion reduces wall-clock overhead from 14.3 percent to 4.8 percent by tolerating bounded staleness in UUCB scores. Across nine benchmarks spanning math reasoning (AIME 2024 and 2025, MATH-500, OlympiadBench, USAMO), web-search agents (GAIA, HLE-100, BrowseComp-lite), and tool-rich coding and OS agents (APPS-verified, AgentBench-OS), InfoTree outperforms flat GRPO, DeepSearch, Tree-GRPO, AT2PO, CW-GRPO, and RC-GRPO. Head-to-head compositions with Tree-GRPO prefix sharing and CW-GRPO contribution weights deliver further gains, confirming that our selector operates orthogonally to rollout reuse and trajectory re-weighting. A 5 by 5 by 5 robustness grid reveals that over three quarters of the hyperparameter space lies on a performance plateau, confirming UUCB robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The submodular recasting of rollout selection for GRPO tree search is the actual novelty, but it rests on an unverified claim that the informativeness objective is monotone submodular and that UUCB exactly equals its marginal gains.

read the letter

The paper's main move is to define Rollout Informativeness under a Fixed Budget as expected non-vanishing policy-gradient mass, prove that budget-agnostic samplers collapse on hard prompts, and then treat state selection as monotone submodular maximization. The greedy selector gets the usual 1-1/e guarantee, and they present UUCB as the closed-form marginal gain that turns the entropy term into a derived quantity rather than a heuristic. They add an Adaptive Budget Allocator and speculative expansion to handle wasted uniform rollouts and wall-clock cost, then show gains over flat GRPO and several tree variants on nine benchmarks plus a hyperparameter robustness grid.

Referee Report

2 major / 3 minor

Summary. The paper formalizes Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set contributes to Group Relative Policy Optimization (GRPO). It proves that budget-agnostic independent samplers incur a collapse rate bounded away from zero on hard prompts. It recasts intermediate state selection as monotone submodular maximization of this objective, for which greedy selection enjoys a (1-1/e) approximation guarantee. The Uncertainty-aware Upper Confidence Bound (UUCB) is derived as the closed-form marginal gain of the submodular objective, turning the token-level entropy bonus into an analytic consequence. The resulting InfoTree framework couples UUCB with a learned Adaptive Budget Allocator (ABA) and asynchronous Speculative Expansion; experiments on nine benchmarks (AIME 2024/2025, MATH-500, OlympiadBench, USAMO, GAIA, HLE-100, BrowseComp-lite, APPS-verified, AgentBench-OS) show gains over flat GRPO and several tree-based baselines, with further orthogonal improvements when combined with prefix sharing or contribution weighting, plus a 5×5×5 robustness grid indicating a broad performance plateau.

Significance. If the submodularity claim and the exact identification of UUCB with marginal gains hold, the work supplies a principled, approximation-guaranteed selector for tree search in agentic RL that explains an empirical trick analytically and delivers measurable gains with modest overhead. The explicit proofs of collapse rate and submodularity, the reproducible robustness grid, and the demonstration that the selector is orthogonal to rollout reuse and re-weighting are concrete strengths. The empirical scope across math, web-search, and tool-use coding/OS agents is appropriate for the claims.

major comments (2)

[§3.2–3.3] §3.2–3.3 (RIFB definition and submodularity theorem): The manuscript asserts that the rollout informativeness set function is monotone and submodular and that its marginal gain f(S ∪ {x}) − f(S) simplifies exactly to the stated UUCB expression. Neither the explicit set-function definition nor the algebraic steps establishing the cancellation for arbitrary finite sets S of rollouts are exhibited. This step is load-bearing for both the 1−1/e guarantee and the claim that UUCB is an “analytic consequence” rather than an empirical heuristic.
[§2.1] §2.1 (collapse-rate theorem): The proof that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts relies on a specific characterization of “non-vanishing policy-gradient mass” under GRPO. The argument does not address how the relative normalization inside GRPO’s advantage estimator affects the mass for prompts whose initial rollouts are uniformly zero-reward; a short counter-example or explicit bound under the GRPO objective would be required to make the claim fully rigorous.

minor comments (3)

[Table 2, §5.3] Table 2 and §5.3: The mixed-outcome ratio improvement (58.1 % → 76.3 %) and wall-clock overhead reduction (14.3 % → 4.8 %) are reported without per-run standard deviations or the exact number of independent seeds; adding these would strengthen the empirical claims.
[Eq. (5)] Notation: The hyper-parameter β appearing in the UUCB formula (Eq. 5) is introduced without an explicit range or sensitivity analysis beyond the robustness grid; a brief remark on its interpretation as a scaling factor for uncertainty would improve clarity.
[§1.2] The paper cites prior tree-search RL methods (DeepSearch, Tree-GRPO, etc.) but does not discuss how the submodular formulation differs from existing information-gain or entropy-based selection criteria in the literature; a short related-work paragraph would help readers situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review, including the positive assessment of the work's significance, the submodularity claim, the robustness grid, and the empirical scope across math, web, and tool-use domains. We address the two major comments point by point below.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3 (RIFB definition and submodularity theorem): The manuscript asserts that the rollout informativeness set function is monotone and submodular and that its marginal gain f(S ∪ {x}) − f(S) simplifies exactly to the stated UUCB expression. Neither the explicit set-function definition nor the algebraic steps establishing the cancellation for arbitrary finite sets S of rollouts are exhibited. This step is load-bearing for both the 1−1/e guarantee and the claim that UUCB is an “analytic consequence” rather than an empirical heuristic.

Authors: We agree that the explicit definition of the RIFB set function and the full algebraic derivation of the marginal gains should be presented more transparently. In the revised manuscript we will add the precise definition f(S) := expected non-vanishing policy-gradient mass contributed by rollout set S, followed by the complete step-by-step expansion of f(S ∪ {x}) − f(S) that shows the cancellations yielding the UUCB expression for arbitrary finite S. This will make the monotonicity, submodularity, and (1−1/e) guarantee fully self-contained and reinforce that UUCB is an analytic consequence of the objective. revision: yes
Referee: [§2.1] §2.1 (collapse-rate theorem): The proof that any budget-agnostic independent sampler suffers a collapse rate bounded away from zero for hard prompts relies on a specific characterization of “non-vanishing policy-gradient mass” under GRPO. The argument does not address how the relative normalization inside GRPO’s advantage estimator affects the mass for prompts whose initial rollouts are uniformly zero-reward; a short counter-example or explicit bound under the GRPO objective would be required to make the claim fully rigorous.

Authors: The collapse-rate argument in §2.1 is formulated in terms of the probability that a rollout set produces at least one non-zero reward, which directly determines non-vanishing mass under GRPO. For the uniform-zero-reward case the relative normalization indeed yields zero advantages within that group, yet the bound concerns the expectation over the outcome distribution. In the revision we will insert both a short counter-example (independent sampling on a hard prompt with all-zero initial rollouts) and an explicit lower bound on the collapse rate that explicitly incorporates GRPO’s group-relative normalization, confirming the rate remains bounded away from zero. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain is self-contained.

full rationale

The paper first defines RIFB explicitly as expected non-vanishing policy-gradient mass under GRPO, proves an independent collapse bound, then separately posits a monotone submodular set function on rollout informativeness whose marginal gains are algebraically shown to equal the UUCB expression. This is a standard modeling-plus-derivation sequence rather than a redefinition or fit; the 1-1/e guarantee follows from the submodularity claim once the marginal equality is established. ABA and Speculative Expansion are auxiliary learned/heuristic modules whose validation is external to the core selector proof. No self-citations, ansatzes smuggled via prior work, or fitted inputs renamed as predictions appear in the load-bearing steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

Ledger is partial because only the abstract is available; full paper may contain additional fitted parameters or domain assumptions. The submodularity assumption is the central load-bearing premise.

free parameters (1)

Adaptive Budget Allocator parameters
Parameters controlling dynamic budget allocation to prompts based on outcome diversity; likely learned or tuned on data.

axioms (1)

domain assumption Rollout informativeness objective is monotone submodular
Invoked to guarantee 1-1/e approximation for the greedy selector and to derive closed-form UUCB marginal gains.

invented entities (3)

Rollout Informativeness under a Fixed Budget (RIFB) no independent evidence
purpose: Quantify expected non-vanishing policy-gradient mass contributed by a rollout set
New definition introduced to formalize the collapse problem.
Uncertainty-aware Upper Confidence Bound (UUCB) no independent evidence
purpose: State selection score derived as marginal gain
Analytic consequence of the submodular objective.
Adaptive Budget Allocator (ABA) no independent evidence
purpose: Dynamically allocate rollout budget across prompts
New component to rescue uniform-outcome prompts.

pith-pipeline@v0.9.0 · 5707 in / 1627 out tokens · 74880 ms · 2026-05-08T17:11:30.249103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 20 canonical work pages · 5 internal anchors

[1]

Jackpot: Optimal budgeted rejection sampling for extreme actor-policy discrep- ancy.arXiv preprint arXiv:2602.06107, 2026

Anonymous. Jackpot: Optimal budgeted rejection sampling for extreme actor-policy discrep- ancy.arXiv preprint arXiv:2602.06107, 2026

work page arXiv 2026
[2]

Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine Learning, 47(2-3):235–256, 2002

2002
[3]

XRPO: Pushing the limits of GRPO with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672, 2025

Jeff Bamba, Liwen Fang, et al. XRPO: Pushing the limits of GRPO with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672, 2025

work page arXiv 2025
[4]

Guillaume M. J.-B. Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo tree search. InComputers and Games, pages 60–71. Springer, 2008

2008
[5]

Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

Xidong Feng, Ziyu Wan, Muning Wen, Stephen M McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023

work page arXiv 2023
[6]

Adaptive submodularity: Theory and applications in active learning and stochastic optimization.Journal of Artificial Intelligence Research, 42: 427–486, 2011

Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization.Journal of Artificial Intelligence Research, 42: 427–486, 2011

2011
[7]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

2023
[8]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for LLM agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025

work page arXiv 2025
[9]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. InMachine Learning: ECML 2006, pages 282–293. Springer, 2006

2006
[10]

Tree Search for Language Model Agents,

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

work page arXiv 2024
[11]

Submodular function maximization

Andreas Krause and Daniel Golovin. Submodular function maximization. InTractability: Practical Approaches to Hard Problems. Cambridge University Press, 2014

2014
[12]

Self-hinting language models enhance reinforcement learning, 2026

Baohao Liao et al. Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026

work page arXiv 2026
[13]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review arXiv 2023
[14]

arXiv preprint arXiv:2603.10887 , year=

Yixiu Mao et al. Dynamics-predictive sampling for active RL finetuning of large reasoning models.arXiv preprint arXiv:2603.10887, 2026

work page arXiv 2026
[15]

Nemhauser, Laurence A

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approx- imations for maximizing submodular set functions—I.Mathematical Programming, 14(1): 265–294, 1978

1978
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[17]

Sheng, C

G. Sheng, C. Zhang, Z. Ye, et al. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 10

2025
[18]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT press, 2018

2018
[19]

MBA-RAG: a bandit approach for adaptive retrieval-augmented generation through question complexity

Xiaoqing Tang, Jiepeng Gao, Xiaochen Li, Zhifang Dai, Bingyang Li, and Bingbing Xu. MBA-RAG: a bandit approach for adaptive retrieval-augmented generation through question complexity. InProceedings of the 31st International Conference on Computational Linguistics (COLING), 2025

2025
[20]

MoSE: Mixture of slimmable experts for efficient and adaptive inference.arXiv preprint arXiv:2602.06154, 2026

Nurbek Tastan, Stefanos Laskaridis, et al. MoSE: Mixture of slimmable experts for efficient and adaptive inference.arXiv preprint arXiv:2602.06154, 2026

work page arXiv 2026
[21]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review arXiv 2022
[22]

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

Jiaxuan Wang, Zhengyan Xi, and Yu Yang. Enhancing LLM-based search agents via contribution weighted group relative policy optimization.arXiv preprint arXiv:2604.14267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, and Yejin Choi. DeepSearch: Overcome the bottleneck of reinforcement learning with verifiable rewards via monte carlo tree search.arXiv preprint arXiv:2509.25454, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Z. Yu, L. Yang, J. Zou, et al. Demystifying reinforcement learning in agentic reasoning.arXiv preprint arXiv:2510.11701, 2025

work page arXiv 2025
[25]

Zhang, S

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search.arXiv preprint arXiv:2406.03816, 2024

work page arXiv 2024
[26]

Planning with large language models for code generation

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. Planning with large language models for code generation. InInternational Conference on Learning Representations, 2023

2023
[27]

arXiv preprint arXiv:2601.06922 (2026) arXiv:2601.06922

Tao Zhang, Kunlun Li, and Jiani Li. TreePS-RAG: Tree-based process supervision for rein- forcement learning in agentic RAG.arXiv preprint arXiv:2601.06922, 2026

work page arXiv 2026
[28]

Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338, 2026

Zhi Zhang, Zhuofeng Han, and Costas Mavromatis. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338, 2026

work page arXiv 2026
[29]

ECHO: Entropy-confidence hybrid optimization for test-time reinforcement learning.arXiv preprint arXiv:2602.02150, 2026

Zhiyuan Zhao, Yifan Yang, et al. ECHO: Entropy-confidence hybrid optimization for test-time reinforcement learning.arXiv preprint arXiv:2602.02150, 2026

work page arXiv 2026
[30]

arXiv preprint arXiv:2602.03025 , year=

Hao Zhong, Junjie Zhai, and Li Song. RC-GRPO: Reward-conditioned group relative policy optimization for multi-turn tool calling agents.arXiv preprint arXiv:2602.03025, 2026

work page arXiv 2026
[31]

At2po: Agentic turn-based policy optimization via tree search, 2026

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, and Jie Jiang. AT2PO: Agentic turn-based policy optimization via tree search.arXiv preprint arXiv:2601.04767, 2026. A Tree Construction Algorithm B Proofs Proof of Theorem 1.Under independent sampling with per-prompt success probability p=p(x) , the reward vector (R(τ1),...

work page arXiv 2026
[32]

}} </tool_call><|im_end|> T oolCalls: [call#0] code_interpreter({

< 0.00001}')"}) T oolResps: [resp#0] no stdout here leaf=8 [step=1, H=0.1867] [Q=-1.000, N=1, U=0.674, H~=1.380, C=1] assistant:Let me set up a coordinate system to solve this problem systematically. I'll place the re ... 88\n\n# Now, let's solve properly\nsol = solve([eq1, eq2], [a, b, p])\nsol"}} </tool_call><|im_end|> T oolCalls: [call#0] code_interpre...

1940
[33]

}} </tool_call><|im_end|> T oolCalls: [call#0] code_interpreter({

\boxed{36}<|im_end|> leaf=4 reward=-1.100 [step=2, H=0.0919] [Q=-0.333, N=3, U=2.137, H~=-0.195, C=2] assistant:It seems there was an issue with the execution environment. Let me try again to solve the ... = \frac{432}{216} = 2$ ✓ This conﬁrms our solution. Therefore, $AB = 18$. $\boxed{18}$<|im_end|> leaf=6,16,17 [step=2, H=0.0556] [Q=1.000, N=1, U=2.090...

1902
[34]

The answer is 24\

= sqrt(900 - 216√10)\n\nprint(\"The answer is 24\")"}} </tool_call><|im_end|> T oolCalls: [call#0] code_interpreter({"code": "# Let's go back to symbolic approach and simplify\n# ... 0 - 18)^2) = sqrt(576 + 36*10 - 216√10 + 324) = sqrt(900 - 216√10)\n\nprint(\"The answer is 24\")"}) T oolResps: [resp#0] no stdout here leaf=9 [step=5, H=0.2624] [Q=-1.000, ...

2030