pith. machine review for the scientific record. sign in

arxiv: 2605.05701 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Inference-Time Budget Control for LLM Search Agents

Hongyao Liu, Jun Huang, Mengzhe Ruan, Senkang Forest Hu, Yihang Tao, Yuguang Fang, Yu Guo, Zhengru Fang, Zhonghao Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM search agentsmulti-hop QAbudget controlValue-of-Informationinference-time optimizationtool usedual budget constraints
0
0 comments X

The pith

A two-stage controller uses Value-of-Information scores to allocate dual budgets during LLM search and final answer commitment in multi-hop QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for LLM search agents that must answer multi-hop questions while respecting hard limits on both tool calls and generated tokens. It introduces a search-time controller that scores each possible next action by its estimated marginal value to the final answer per unit of remaining budget, then chooses retrieval, decomposition, or answer commitment accordingly. After the search trajectory ends, a selective finalizer compares the trajectory answer against a refined candidate and rewrites only when the difference looks like a low-risk form error. Experiments across four benchmarks, three model families, and multiple budget levels show consistent gains over four different baselines that operate under the same strict dual-budget rules. Ablations indicate that the dynamic scoring during search drives most of the improvement.

Core claim

The central claim is that assigning each feasible search action an operational Value-of-Information score—defined as the estimated marginal task value per unit budget given the current state and remaining dual limits—allows the agent to choose more productively among retrieval, decomposition, and commitment, while a post-search selective evidence-grounded finalizer corrects only low-risk answer-form errors; together these produce positive aggregate performance gains over audited baselines under identical hard dual-budget constraints.

What carries the argument

The operational Value-of-Information (VOI) score, which estimates marginal task value per unit budget from the current search state and remaining dual budget to rank and select among retrieval, decomposition, and answer commitment actions.

Load-bearing premise

The Value-of-Information score can be computed reliably from the current search state and remaining dual budget without itself consuming prohibitive extra budget or introducing systematic bias into action selection.

What would settle it

An experiment on one of the multi-hop QA benchmarks in which the VOI controller produces lower accuracy than a simple fixed-allocation or round-robin baseline when both are forced to obey the same tool-call and token limits.

read the original abstract

LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that a two-stage inference-time budget control method for LLM search agents in multi-hop QA—using a search-time controller to assign Value-of-Information (VOI) scores (operational estimates of marginal task value per unit budget, incorporating a budget-dependent penalty) to choose among retrieval, decomposition, and commitment actions, followed by an answer-time selective evidence-grounded finalizer—yields positive aggregate performance gains over four audited baselines. These gains are demonstrated across four multi-hop QA benchmarks, three LLM backbones, and four budget levels under a consistent hard dual-budget (tool calls and tokens) protocol, with ablations attributing the primary benefit to search-time control.

Significance. If the central claims hold after addressing overhead and bias concerns, the work would provide a practical mechanism for improving LLM agent performance under strict inference-time resource limits, with the multi-model, multi-benchmark evaluation and component ablations serving as useful empirical grounding. The explicit focus on dual budgets and the isolation of the budget-dependent penalty as a key factor represent strengths that could guide future agent designs.

major comments (2)
  1. [§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.
  2. [Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.
minor comments (3)
  1. [Abstract] Abstract: The four audited baselines are referenced but not named; explicitly listing them (e.g., in the abstract or §4) would improve reproducibility and context for the gains.
  2. [Notation] Notation: The exact formula for the budget-dependent penalty within the VOI score should be provided as an equation (currently described only operationally) to allow independent verification.
  3. [Results] Results: Tables reporting aggregate gains should include per-budget and per-model breakdowns with variance measures to support the 'positive aggregate' claim and enable finer-grained analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address the major concerns regarding the accounting of computational overhead for the VOI-based controller in the following point-by-point responses. We believe these clarifications and proposed revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.

    Authors: We appreciate the referee pointing out this ambiguity in our description. Upon review, the VOI computation in our method is performed using a deterministic heuristic function that takes as input the current state features (such as number of hops, remaining budget, and evidence quality metrics) and the remaining dual budget. This heuristic does not involve any additional LLM inferences or tool calls; it is a closed-form calculation based on the budget-dependent penalty and estimated marginal gain. As such, it incurs no consumption from the hard dual budget. To address the concern, we will revise §3 to explicitly state this and provide the exact formula for the heuristic. Additionally, we will include an overhead analysis in the experimental protocol section. revision: yes

  2. Referee: [Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.

    Authors: We agree that providing an explicit overhead audit table is necessary to fully substantiate the claims under the hard dual-budget protocol. We will add a new table (e.g., Table X) that reports the average number of tokens and tool calls used specifically for VOI scoring computations across all experiments, for each budget level and backbone. Since our VOI heuristic is non-LLM based, we expect and will verify that this overhead is zero. This will confirm that the effective budget allocation is identical to the baselines and that the performance gains are not due to unaccounted resources or bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; method is a proposed heuristic evaluated empirically.

full rationale

The paper defines a practical two-stage budget controller that assigns VOI scores operationally from state and remaining dual budget, then reports aggregate gains on four benchmarks under a fixed protocol. No equation or step in the provided text reduces the VOI definition or performance claim to a self-referential fit, a parameter tuned on the evaluation data, or a self-citation chain. The derivation is an algorithmic proposal whose validity rests on external benchmark comparisons rather than internal re-derivation of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability to define and estimate a VOI score that genuinely reflects marginal value; this introduces at least one domain assumption and likely one or more free parameters for the penalty or threshold used in action selection.

free parameters (1)
  • budget-dependent penalty coefficient
    Ablations highlight that the budget-dependent penalty provides the main gain, implying a tunable scalar that must be chosen or fitted for each budget level and model.
axioms (1)
  • domain assumption The operational VOI score computed from current state and remaining dual budget accurately ranks the expected marginal improvement of each feasible action.
    Invoked when the controller uses the score to choose among retrieval, decomposition, and commitment.

pith-pipeline@v0.9.0 · 5582 in / 1361 out tokens · 24755 ms · 2026-05-08T11:46:37.603953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 14 internal anchors

  1. [1]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

  2. [3]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

  3. [6]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

  4. [7]

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023. 11/36 Budget Control for LLM Search Agents

  5. [8]

    Application-Aware Twin-in-the-Loop Planning for Federated Split Learning over Wireless Edge Networks

    Zihao Ding, Beining Wu, Jun Huang, and Shiwen Mao. Application-aware twin-in-the-loop planning for federated split learning over wireless edge networks.arXiv preprint arXiv:2604.26105, 2026

  6. [9]

    Promptbreeder: Self-referential self-improvement via prompt evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktaschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InICML. OpenReview.net, 2024

  7. [12]

    Token- budget-aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token- budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

  8. [13]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  9. [14]

    Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

    Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, and Yuguang Fang. Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

  10. [15]

    Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026

    Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang. Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026. [16]Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2024

  11. [16]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  12. [17]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into state-of-the- art pipelines. InThe Twelfth International Conference on Learning Represen...

  13. [18]

    arXiv preprint arXiv:2506.04301 , year=

    Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. The cost of dynamic reasoning: Demystify- ing ai agents and test-time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301, 2025

  14. [20]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366,

  15. [22]

    Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

    Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  16. [23]

    Spend less, reason better: Budget-aware value tree search for llm agents, 2026

    Yushu Li, Wenlong Deng, Jiajin Li, and Xiaoxiao Li. Spend less, reason better: Budget-aware value tree search for llm agents, 2026

  17. [24]

    arXiv preprint arXiv:2407.12821 , year=

    Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents.CoRR, abs/2407.12821, 2024

  18. [25]

    SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

    Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025

  19. [26]

    Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

    Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024

  20. [27]

    Eoh-s: Evolution of heuristic set using llms for automated heuristic design

    Fei Liu, Yilu Liu, Qingfu Zhang, Tong Xialiang, and Mingxuan Yuan. Eoh-s: Evolution of heuristic set using llms for automated heuristic design. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37090–37098, 2026. [28]Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen,...

  21. [28]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2024

  22. [29]

    OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

    Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octotools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025

  23. [30]

    Exploring autonomous agents: A closer look at why they fail when completing tasks.arXiv preprint arXiv:2508.13143, 2025

    Ruofan Lu, Yichen Li, and Yintong Huo. Exploring autonomous agents: A closer look at why they fail when completing tasks.arXiv preprint arXiv:2508.13143, 2025

  24. [31]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...

  25. [34]

    Introducing GPT-5.4 mini and nano

    OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-05-05

  26. [36]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measur- ing and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

  27. [38]

    Autoact: Automatic agent learning from scratch via self-planning

    Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. Autoact: Automatic agent learning from scratch via self-planning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  28. [39]

    Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025

    Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025

  29. [40]

    Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026

    Guanqiao Qu, Zheng Lin, Qian Chen, Jian Li, Fangming Liu, Xianhao Chen, and Kaibin Huang. Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026

  30. [41]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen. ai/blog?id=qwen3.5

  31. [42]

    Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  32. [43]

    Agentsquare: Automatic llm agent search in modular design space, 2024

    Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space, 2024

  33. [44]

    Thinking vs

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Agents that reason by scaling test-time interaction, 2025

  34. [45]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  35. [46]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 14/36 Budget Control for LLM Search Agents

  36. [47]

    Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, and Alon Y. Halevy. Verifai: Verified generative AI. InCIDR. www.cidrdb.org, 2024

  37. [48]

    Proagentbench: Evaluating llm agents for proactive assistance with real-world data

    Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. Proagentbench: Evaluating llm agents for proactive assistance with real-world data.arXiv preprint arXiv:2602.04482, 2026

  38. [49]

    M u S i Q ue: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multi- hop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475

  39. [50]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  40. [51]

    Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling

    Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, A...

  41. [52]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  42. [54]

    Xing, and Zhiting Hu

    Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P . Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InICLR. OpenReview.net, 2024

  43. [55]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  44. [56]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  45. [57]

    From decoding to meta-generation: Inference-time algorithms for large language models.Trans

    Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.Trans. Mach. Learn. Res., 2024, 2024

  46. [59]

    Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems

    Beining Wu and Jun Huang. Lifecycle-aware federated continual learning in mobile autonomous systems.arXiv preprint arXiv:2604.20745, 2026. 15/36 Budget Control for LLM Search Agents

  47. [60]

    A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026

    Beining Wu, Zihao Ding, and Jun Huang. A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026

  48. [62]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  49. [63]

    Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026

    Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, and Mingsong Chen. Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026

  50. [64]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  51. [65]

    Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  52. [66]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369– 2380, 2018

  53. [67]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

  54. [68]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  55. [69]

    Evoflow: Evolving diverse agentic workflows on the fly, 2025

    Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai. Evoflow: Evolving diverse agentic workflows on the fly, 2025

  56. [70]

    Aflow: Automating agentic workflow generation, 2024

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2024

  57. [71]

    Ecoassistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046, 2023

    Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046, 2023

  58. [73]

    Cost-awareness in tree-search llm planning: A systematic study, 2025

    Zihao Zhang, Hui Wei, Kenan Jiang, Shijia Pan, Shu Kai, and Fei Liu. Cost-awareness in tree-search llm planning: A systematic study, 2025

  59. [74]

    Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025

    Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, and Haiyan Yin. Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025. 16/36 Budget Control for LLM Search Agents

  60. [75]

    Debug like a human: A large language model debugger via verifying runtime execution step by step

    Li Zhong, Zilong Wang, and Jingbo Shang. Debug like a human: A large language model debugger via verifying runtime execution step by step. InACL (Findings), pages 851–870. Association for Computational Linguistics, 2024

  61. [76]

    Language agent tree search unifies reasoning, acting, and planning in language models,

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

  62. [77]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  63. [78]

    Oagents: An empirical study of building effective agents.CoRR, abs/2506.15741,

    He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou. Oagents: An empirical study of building effective agen...

  64. [79]

    doi: 10.48550/ARXIV .2506.15741

  65. [80]

    Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025

    King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, et al. Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025. 17/36 Budget Control for LLM Search Agents A. Controller Behavior Analysis This appendix complements Section 4 with descriptive evidence on t...