Recognition: unknown
Inference-Time Budget Control for LLM Search Agents
Pith reviewed 2026-05-08 11:46 UTC · model grok-4.3
The pith
A two-stage controller uses Value-of-Information scores to allocate dual budgets during LLM search and final answer commitment in multi-hop QA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that assigning each feasible search action an operational Value-of-Information score—defined as the estimated marginal task value per unit budget given the current state and remaining dual limits—allows the agent to choose more productively among retrieval, decomposition, and commitment, while a post-search selective evidence-grounded finalizer corrects only low-risk answer-form errors; together these produce positive aggregate performance gains over audited baselines under identical hard dual-budget constraints.
What carries the argument
The operational Value-of-Information (VOI) score, which estimates marginal task value per unit budget from the current search state and remaining dual budget to rank and select among retrieval, decomposition, and answer commitment actions.
Load-bearing premise
The Value-of-Information score can be computed reliably from the current search state and remaining dual budget without itself consuming prohibitive extra budget or introducing systematic bias into action selection.
What would settle it
An experiment on one of the multi-hop QA benchmarks in which the VOI controller produces lower accuracy than a simple fixed-allocation or round-robin baseline when both are forced to obey the same tool-call and token limits.
read the original abstract
LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a two-stage inference-time budget control method for LLM search agents in multi-hop QA—using a search-time controller to assign Value-of-Information (VOI) scores (operational estimates of marginal task value per unit budget, incorporating a budget-dependent penalty) to choose among retrieval, decomposition, and commitment actions, followed by an answer-time selective evidence-grounded finalizer—yields positive aggregate performance gains over four audited baselines. These gains are demonstrated across four multi-hop QA benchmarks, three LLM backbones, and four budget levels under a consistent hard dual-budget (tool calls and tokens) protocol, with ablations attributing the primary benefit to search-time control.
Significance. If the central claims hold after addressing overhead and bias concerns, the work would provide a practical mechanism for improving LLM agent performance under strict inference-time resource limits, with the multi-model, multi-benchmark evaluation and component ablations serving as useful empirical grounding. The explicit focus on dual budgets and the isolation of the budget-dependent penalty as a key factor represent strengths that could guide future agent designs.
major comments (2)
- [§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.
- [Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.
minor comments (3)
- [Abstract] Abstract: The four audited baselines are referenced but not named; explicitly listing them (e.g., in the abstract or §4) would improve reproducibility and context for the gains.
- [Notation] Notation: The exact formula for the budget-dependent penalty within the VOI score should be provided as an equation (currently described only operationally) to allow independent verification.
- [Results] Results: Tables reporting aggregate gains should include per-budget and per-model breakdowns with variance measures to support the 'positive aggregate' claim and enable finer-grained analysis.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our work. We address the major concerns regarding the accounting of computational overhead for the VOI-based controller in the following point-by-point responses. We believe these clarifications and proposed revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 (search-time controller and VOI definition)] §3 (search-time controller and VOI definition): The VOI score is defined as an operational estimate from current state and remaining dual budget, with ablations crediting the budget-dependent penalty coefficient for most gains. However, the manuscript does not specify whether VOI computation (e.g., any LLM calls or heuristics) consumes tokens or tool calls from the hard dual budget; if not explicitly deducted and audited, the effective budget for the proposed method differs from baselines, undermining the aggregate gains claim.
Authors: We appreciate the referee pointing out this ambiguity in our description. Upon review, the VOI computation in our method is performed using a deterministic heuristic function that takes as input the current state features (such as number of hops, remaining budget, and evidence quality metrics) and the remaining dual budget. This heuristic does not involve any additional LLM inferences or tool calls; it is a closed-form calculation based on the budget-dependent penalty and estimated marginal gain. As such, it incurs no consumption from the hard dual budget. To address the concern, we will revise §3 to explicitly state this and provide the exact formula for the heuristic. Additionally, we will include an overhead analysis in the experimental protocol section. revision: yes
-
Referee: [Experimental protocol and ablations] Experimental protocol and ablations: The claim of positive gains under the same hard dual-budget protocol across four budgets and three backbones requires an explicit overhead audit table showing token/tool consumption for VOI scoring versus baselines. Without this, the results risk being partly artifactual due to unaccounted costs or selection bias favoring low-cost actions at low remaining budget.
Authors: We agree that providing an explicit overhead audit table is necessary to fully substantiate the claims under the hard dual-budget protocol. We will add a new table (e.g., Table X) that reports the average number of tokens and tool calls used specifically for VOI scoring computations across all experiments, for each budget level and backbone. Since our VOI heuristic is non-LLM based, we expect and will verify that this overhead is zero. This will confirm that the effective budget allocation is identical to the baselines and that the performance gains are not due to unaccounted resources or bias. revision: yes
Circularity Check
No significant circularity detected; method is a proposed heuristic evaluated empirically.
full rationale
The paper defines a practical two-stage budget controller that assigns VOI scores operationally from state and remaining dual budget, then reports aggregate gains on four benchmarks under a fixed protocol. No equation or step in the provided text reduces the VOI definition or performance claim to a self-referential fit, a parameter tuned on the evaluation data, or a self-citation chain. The derivation is an algorithmic proposal whose validity rests on external benchmark comparisons rather than internal re-derivation of its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- budget-dependent penalty coefficient
axioms (1)
- domain assumption The operational VOI score computed from current state and remaining dual budget accurately ranks the expected marginal improvement of each feasible action.
Reference graph
Works this paper leans on
-
[1]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024
2024
-
[3]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. InThe Twelfth International Conference on Learning Representations, 2023. 11/36 Budget Control for LLM Search Agents
2023
-
[8]
Application-Aware Twin-in-the-Loop Planning for Federated Split Learning over Wireless Edge Networks
Zihao Ding, Beining Wu, Jun Huang, and Shiwen Mao. Application-aware twin-in-the-loop planning for federated split learning over wireless edge networks.arXiv preprint arXiv:2604.26105, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Promptbreeder: Self-referential self-improvement via prompt evolution
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktaschel. Promptbreeder: Self-referential self-improvement via prompt evolution. InICML. OpenReview.net, 2024
2024
-
[12]
Token- budget-aware llm reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token- budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025
2025
-
[13]
Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020
2020
-
[14]
Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, and Yuguang Fang. Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025
-
[15]
Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang. Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward.arXiv preprint arXiv:2602.00845, 2026. [16]Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2024
-
[16]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into state-of-the- art pipelines. InThe Twelfth International Conference on Learning Represen...
2024
-
[18]
arXiv preprint arXiv:2506.04301 , year=
Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. The cost of dynamic reasoning: Demystify- ing ai agents and test-time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301, 2025
-
[20]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.CoRR, abs/2501.05366,
work page internal anchor Pith review arXiv
-
[22]
Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
2024
-
[23]
Spend less, reason better: Budget-aware value tree search for llm agents, 2026
Yushu Li, Wenlong Deng, Jiajin Li, and Xiaoxiao Li. Spend less, reason better: Budget-aware value tree search for llm agents, 2026
2026
-
[24]
arXiv preprint arXiv:2407.12821 , year=
Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. Autoflow: Automated workflow generation for large language model agents.CoRR, abs/2407.12821, 2024
-
[25]
SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Evolution of heuristics: Towards efficient automatic algorithm design using large language model,
Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024
-
[27]
Eoh-s: Evolution of heuristic set using llms for automated heuristic design
Fei Liu, Yilu Liu, Qingfu Zhang, Tong Xialiang, and Mingxuan Yuan. Eoh-s: Evolution of heuristic set using llms for automated heuristic design. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37090–37098, 2026. [28]Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen,...
2026
-
[28]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[29]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octotools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Ruofan Lu, Yichen Li, and Yintong Huo. Exploring autonomous agents: A closer look at why they fail when completing tasks.arXiv preprint arXiv:2508.13143, 2025
-
[31]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sys...
2023
-
[34]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, March 2026. Accessed: 2026-05-05
2026
-
[36]
Smith, and Mike Lewis
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measur- ing and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023
2023
-
[38]
Autoact: Automatic agent learning from scratch via self-planning
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen. Autoact: Automatic agent learning from scratch via self-planning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[39]
Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 27(6):3820–3860, 2025
2025
-
[40]
Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026
Guanqiao Qu, Zheng Lin, Qian Chen, Jian Li, Fangming Liu, Xianhao Chen, and Kaibin Huang. Trimcaching: Parameter-sharing edge caching for ai model downloading.IEEE Transactions on Networking, 2026
2026
-
[41]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen. ai/blog?id=qwen3.5
2026
-
[42]
Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023
2023
-
[43]
Agentsquare: Automatic llm agent search in modular design space, 2024
Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space, 2024
2024
-
[44]
Thinking vs
Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar. Thinking vs. doing: Agents that reason by scaling test-time interaction, 2025
2025
-
[45]
Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[46]
Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 14/36 Budget Control for LLM Search Agents
2025
-
[47]
Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, and Alon Y. Halevy. Verifai: Verified generative AI. InCIDR. www.cidrdb.org, 2024
2024
-
[48]
Proagentbench: Evaluating llm agents for proactive assistance with real-world data
Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li. Proagentbench: Evaluating llm agents for proactive assistance with real-world data.arXiv preprint arXiv:2602.04482, 2026
-
[49]
M u S i Q ue: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multi- hop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475
-
[50]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...
-
[51]
Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling
Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li. Reasoning aware self-consistency: Leveraging reasoning paths for efficient LLM sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, A...
-
[52]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Xing, and Zhiting Hu
Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P . Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. InICLR. OpenReview.net, 2024
2024
-
[55]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review arXiv 2022
-
[56]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review arXiv 2025
-
[57]
From decoding to meta-generation: Inference-time algorithms for large language models.Trans
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. From decoding to meta-generation: Inference-time algorithms for large language models.Trans. Mach. Learn. Res., 2024, 2024
2024
-
[59]
Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems
Beining Wu and Jun Huang. Lifecycle-aware federated continual learning in mobile autonomous systems.arXiv preprint arXiv:2604.20745, 2026. 15/36 Budget Control for LLM Search Agents
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[60]
A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026
Beining Wu, Zihao Ding, and Jun Huang. A review of continual learning in edge ai.IEEE Transactions on Network Science and Engineering, 2026
2026
-
[62]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
2024
-
[63]
Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026
Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, and Mingsong Chen. Hyevo: Self-evolving hybrid agentic workflows for efficient reasoning, 2026
2026
-
[64]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[65]
Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
2024
-
[66]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369– 2380, 2018
2018
-
[67]
Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023
2023
-
[68]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[69]
Evoflow: Evolving diverse agentic workflows on the fly, 2025
Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai. Evoflow: Evolving diverse agentic workflows on the fly, 2025
2025
-
[70]
Aflow: Automating agentic workflow generation, 2024
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2024
2024
-
[71]
Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. Ecoassistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046, 2023
-
[73]
Cost-awareness in tree-search llm planning: A systematic study, 2025
Zihao Zhang, Hui Wei, Kenan Jiang, Shijia Pan, Shu Kai, and Fei Liu. Cost-awareness in tree-search llm planning: A systematic study, 2025
2025
-
[74]
Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025
Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, and Haiyan Yin. Mermaidflow: Redefining agentic workflow generation via safety- constrained evolutionary programming, 2025. 16/36 Budget Control for LLM Search Agents
2025
-
[75]
Debug like a human: A large language model debugger via verifying runtime execution step by step
Li Zhong, Zilong Wang, and Jingbo Shang. Debug like a human: A large language model debugger via verifying runtime execution step by step. InACL (Findings), pages 851–870. Association for Computational Linguistics, 2024
2024
-
[76]
Language agent tree search unifies reasoning, acting, and planning in language models,
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023
-
[77]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023
work page internal anchor Pith review arXiv 2023
-
[78]
Oagents: An empirical study of building effective agents.CoRR, abs/2506.15741,
He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou. Oagents: An empirical study of building effective agen...
-
[79]
doi: 10.48550/ARXIV .2506.15741
work page internal anchor Pith review doi:10.48550/arxiv
-
[80]
Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025
King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, et al. Scaling test-time compute for llm agents.arXiv preprint arXiv:2506.12928, 2025. 17/36 Budget Control for LLM Search Agents A. Controller Behavior Analysis This appendix complements Section 4 with descriptive evidence on t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.