arxiv: 2605.05701 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Inference-Time Budget Control for LLM Search Agents

Hongyao Liu, Jun Huang, Mengzhe Ruan, Senkang Forest Hu, Yihang Tao, Yuguang Fang, Yu Guo, Zhengru Fang, Zhonghao Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM search agentsmulti-hop QAbudget controlValue-of-Informationinference-time optimizationtool usedual budget constraints

0 comments

The pith

A two-stage controller uses Value-of-Information scores to allocate dual budgets during LLM search and final answer commitment in multi-hop QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for LLM search agents that must answer multi-hop questions while respecting hard limits on both tool calls and generated tokens. It introduces a search-time controller that scores each possible next action by its estimated marginal value to the final answer per unit of remaining budget, then chooses retrieval, decomposition, or answer commitment accordingly. After the search trajectory ends, a selective finalizer compares the trajectory answer against a refined candidate and rewrites only when the difference looks like a low-risk form error. Experiments across four benchmarks, three model families, and multiple budget levels show consistent gains over four different baselines that operate under the same strict dual-budget rules. Ablations indicate that the dynamic scoring during search drives most of the improvement.

Core claim

The central claim is that assigning each feasible search action an operational Value-of-Information score—defined as the estimated marginal task value per unit budget given the current state and remaining dual limits—allows the agent to choose more productively among retrieval, decomposition, and commitment, while a post-search selective evidence-grounded finalizer corrects only low-risk answer-form errors; together these produce positive aggregate performance gains over audited baselines under identical hard dual-budget constraints.

What carries the argument

The operational Value-of-Information (VOI) score, which estimates marginal task value per unit budget from the current search state and remaining dual budget to rank and select among retrieval, decomposition, and answer commitment actions.

Load-bearing premise

The Value-of-Information score can be computed reliably from the current search state and remaining dual budget without itself consuming prohibitive extra budget or introducing systematic bias into action selection.

What would settle it

An experiment on one of the multi-hop QA benchmarks in which the VOI controller produces lower accuracy than a simple fixed-allocation or round-robin baseline when both are forced to obey the same tool-call and token limits.

read the original abstract

LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a two-stage budget controller for LLM search agents that uses a state-and-budget-dependent VOI score at search time plus selective finalization, and it reports aggregate gains on multi-hop QA, but the overhead and exact computation of that VOI score are not spelled out enough to judge whether the gains are fully independent of the controller's own cost.

read the letter

The main takeaway is that the authors split the problem into search-time action selection and answer-time commitment under hard dual limits on tool calls and tokens. They assign each possible move a VOI score meant to estimate marginal value per remaining budget unit, pick the highest, and later run a lightweight finalizer that only rewrites when the trajectory answer looks like a low-risk formatting slip. Across four benchmarks, three backbones, and four budget settings they beat four baselines under the same protocol, with ablations crediting the budget-dependent penalty term for most of the lift. That framing is new relative to fixed-limit or heuristic stopping rules, and the dual-budget protocol itself is a useful constraint to enforce in experiments. The work is practically motivated and the results are presented at a level that lets a reader see where the gains come from. The soft spot is the VOI computation itself. The abstract calls it an operational estimate from current state and remaining budget, yet gives no formula, no count of extra LLM calls or tokens it consumes, and no audit showing those costs are subtracted from the hard limits. If scoring requires non-trivial inference, the effective search budget for the proposed method differs from the baselines, which would inflate the reported advantage. The stress-test concern about unaccounted overhead or selection bias therefore lands until the paper shows a zero-overhead implementation or explicit accounting. The finalizer helps only when retrieval is already adequate, so it is not the load-bearing piece. Readers who build or deploy tool-using agents with tight inference budgets will find the experiments and the two-stage split useful. The paper is coherent on its own terms and engages the right literature, so it deserves a serious referee even if the central efficiency claim needs tighter verification on the overhead question. I would send it to review.

Referee Report

2 major / 3 minor

Summary. The paper claims that a two-stage inference-time budget control method for LLM search agents in multi-hop QA—using a search-time controller to assign Value-of-Information (VOI) scores (operational estimates of marginal task value per unit budget, incorporating a budget-dependent penalty) to choose among retrieval, decomposition, and commitment actions, followed by an answer-time selective evidence-grounded finalizer—yields positive aggregate performance gains over four audited baselines. These gains are demonstrated across four multi-hop QA benchmarks, three LLM backbones, and four budget levels under a consistent hard dual-budget (tool calls and tokens) protocol, with ablations attributing the primary benefit to search-time control.

Significance. If the central claims hold after addressing overhead and bias concerns, the work would provide a practical mechanism for improving LLM agent performance under strict inference-time resource limits, with the multi-model, multi-benchmark evaluation and component ablations serving as useful empirical grounding. The explicit focus on dual budgets and the isolation of the budget-dependent penalty as a key factor represent strengths that could guide future agent designs.

major comments (2)

[§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.
[Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.

minor comments (3)

[Abstract] Abstract: The four audited baselines are referenced but not named; explicitly listing them (e.g., in the abstract or §4) would improve reproducibility and context for the gains.
[Notation] Notation: The exact formula for the budget-dependent penalty within the VOI score should be provided as an equation (currently described only operationally) to allow independent verification.
[Results] Results: Tables reporting aggregate gains should include per-budget and per-model breakdowns with variance measures to support the 'positive aggregate' claim and enable finer-grained analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address the major concerns regarding the accounting of computational overhead for the VOI-based controller in the following point-by-point responses. We believe these clarifications and proposed revisions will strengthen the manuscript.

read point-by-point responses

Referee: [§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.

Authors: We appreciate the referee pointing out this ambiguity in our description. Upon review, the VOI computation in our method is performed using a deterministic heuristic function that takes as input the current state features (such as number of hops, remaining budget, and evidence quality metrics) and the remaining dual budget. This heuristic does not involve any additional LLM inferences or tool calls; it is a closed-form calculation based on the budget-dependent penalty and estimated marginal gain. As such, it incurs no consumption from the hard dual budget. To address the concern, we will revise §3 to explicitly state this and provide the exact formula for the heuristic. Additionally, we will include an overhead analysis in the experimental protocol section. revision: yes
Referee: [Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.

Authors: We agree that providing an explicit overhead audit table is necessary to fully substantiate the claims under the hard dual-budget protocol. We will add a new table (e.g., Table X) that reports the average number of tokens and tool calls used specifically for VOI scoring computations across all experiments, for each budget level and backbone. Since our VOI heuristic is non-LLM based, we expect and will verify that this overhead is zero. This will confirm that the effective budget allocation is identical to the baselines and that the performance gains are not due to unaccounted resources or bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; method is a proposed heuristic evaluated empirically.

full rationale

The paper defines a practical two-stage budget controller that assigns VOI scores operationally from state and remaining dual budget, then reports aggregate gains on four benchmarks under a fixed protocol. No equation or step in the provided text reduces the VOI definition or performance claim to a self-referential fit, a parameter tuned on the evaluation data, or a self-citation chain. The derivation is an algorithmic proposal whose validity rests on external benchmark comparisons rather than internal re-derivation of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability to define and estimate a VOI score that genuinely reflects marginal value; this introduces at least one domain assumption and likely one or more free parameters for the penalty or threshold used in action selection.

free parameters (1)

budget-dependent penalty coefficient
Ablations highlight that the budget-dependent penalty provides the main gain, implying a tunable scalar that must be chosen or fitted for each budget level and model.

axioms (1)

domain assumption The operational VOI score computed from current state and remaining dual budget accurately ranks the expected marginal improvement of each feasible action.
Invoked when the controller uses the score to choose among retrieval, decomposition, and commitment.

pith-pipeline@v0.9.0 · 5582 in / 1361 out tokens · 24755 ms · 2026-05-08T11:46:37.603953+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 14 internal anchors

[1]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

2024
[3]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review arXiv 2025
[6]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[7]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023. 11/36 Budget Control for LLM Search Agents

2023
[8]

Application-Aware Twin-in-the-Loop Planning for Federated Split Learning over Wireless Edge Networks

Zihao Ding, Beining Wu, Jun Huang, and Shiwen Mao. Application-aware twin-in-the-loop planning for federated split learning over wireless edge networks.arXiv preprint arXiv:2604.26105, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Promptbreeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktaschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InICML. OpenReview.net, 2024

2024
[12]

Token- budget-aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token- budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

2025
[13]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

2020
[14]

Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, and Yuguang Fang. Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

work page arXiv 2025
[15]

Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang. Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026. [16]Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2024

work page arXiv 2026
[16]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[17]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into state-of-the- art pipelines. InThe Twelfth International Conference on Learning Represen...

2024
[18]

arXiv preprint arXiv:2506.04301 , year=

Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. The cost of dynamic reasoning: Demystify- ing ai agents and test-time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301, 2025

work page arXiv 2025
[20]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366,

work page internal anchor Pith review arXiv
[22]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[23]

Spend less, reason better: Budget-aware value tree search for llm agents, 2026

Yushu Li, Wenlong Deng, Jiajin Li, and Xiaoxiao Li. Spend less, reason better: Budget-aware value tree search for llm agents, 2026

2026
[24]

arXiv preprint arXiv:2407.12821 , year=

Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents.CoRR, abs/2407.12821, 2024

work page arXiv 2024
[25]

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024

work page arXiv 2024
[27]

Eoh-s: Evolution of heuristic set using llms for automated heuristic design

Fei Liu, Yilu Liu, Qingfu Zhang, Tong Xialiang, and Mingxuan Yuan. Eoh-s: Evolution of heuristic set using llms for automated heuristic design. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37090–37098, 2026. [28]Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen,...

2026
[28]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2024

2024
[29]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octotools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Exploring autonomous agents: A closer look at why they fail when completing tasks.arXiv preprint arXiv:2508.13143, 2025

Ruofan Lu, Yichen Li, and Yintong Huo. Exploring autonomous agents: A closer look at why they fail when completing tasks.arXiv preprint arXiv:2508.13143, 2025

work page arXiv 2025
[31]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

2023
[34]

Introducing GPT-5.4 mini and nano

OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-05-05

2026
[36]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measur- ing and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

2023
[38]

Autoact: Automatic agent learning from scratch via self-planning

Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. Autoact: Automatic agent learning from scratch via self-planning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[39]

Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025

Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025

2025
[40]

Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026

Guanqiao Qu, Zheng Lin, Qian Chen, Jian Li, Fangming Liu, Xianhao Chen, and Kaibin Huang. Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026

2026
[41]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen. ai/blog?id=qwen3.5

2026
[42]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

2023
[43]

Agentsquare: Automatic llm agent search in modular design space, 2024

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space, 2024

2024
[44]

Thinking vs

Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Agents that reason by scaling test-time interaction, 2025

2025
[45]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[46]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 14/36 Budget Control for LLM Search Agents

2025
[47]

Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, and Alon Y. Halevy. Verifai: Verified generative AI. InCIDR. www.cidrdb.org, 2024

2024
[48]

Proagentbench: Evaluating llm agents for proactive assistance with real-world data

Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. Proagentbench: Evaluating llm agents for proactive assistance with real-world data.arXiv preprint arXiv:2602.04482, 2026

work page arXiv 2026
[49]

M u S i Q ue: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multi- hop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475

work page doi:10.1162/tacl_a_00475 2022
[50]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2023.acl-long.557 2023
[51]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling

Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, A...

work page doi:10.18653/v1/2025.naacl-long.184 2025
[52]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review arXiv 2023
[54]

Xing, and Zhiting Hu

Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P . Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InICLR. OpenReview.net, 2024

2024
[55]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review arXiv 2022
[56]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review arXiv 2025
[57]

From decoding to meta-generation: Inference-time algorithms for large language models.Trans

Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.Trans. Mach. Learn. Res., 2024, 2024

2024
[59]

Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems

Beining Wu and Jun Huang. Lifecycle-aware federated continual learning in mobile autonomous systems.arXiv preprint arXiv:2604.20745, 2026. 15/36 Budget Control for LLM Search Agents

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026

Beining Wu, Zihao Ding, and Jun Huang. A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026

2026
[62]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[63]

Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026

Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, and Mingsong Chen. Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026

2026
[64]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[65]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[66]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369– 2380, 2018

2018
[67]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

2023
[68]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[69]

Evoflow: Evolving diverse agentic workflows on the fly, 2025

Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai. Evoflow: Evolving diverse agentic workflows on the fly, 2025

2025
[70]

Aflow: Automating agentic workflow generation, 2024

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2024

2024
[71]

Ecoassistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046, 2023

Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046, 2023

work page arXiv 2023
[73]

Cost-awareness in tree-search llm planning: A systematic study, 2025

Zihao Zhang, Hui Wei, Kenan Jiang, Shijia Pan, Shu Kai, and Fei Liu. Cost-awareness in tree-search llm planning: A systematic study, 2025

2025
[74]

Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025

Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, and Haiyan Yin. Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025. 16/36 Budget Control for LLM Search Agents

2025
[75]

Debug like a human: A large language model debugger via verifying runtime execution step by step

Li Zhong, Zilong Wang, and Jingbo Shang. Debug like a human: A large language model debugger via verifying runtime execution step by step. InACL (Findings), pages 851–870. Association for Computational Linguistics, 2024

2024
[76]

Language agent tree search unifies reasoning, acting, and planning in language models,

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

work page arXiv 2023
[77]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review arXiv 2023
[78]

Oagents: An empirical study of building effective agents.CoRR, abs/2506.15741,

He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou. Oagents: An empirical study of building effective agen...

work page arXiv
[79]

doi: 10.48550/ARXIV .2506.15741

work page internal anchor Pith review doi:10.48550/arxiv
[80]

Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025

King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, et al. Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025. 17/36 Budget Control for LLM Search Agents A. Controller Behavior Analysis This appendix complements Section 4 with descriptive evidence on t...

work page arXiv 2025