Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
Pith reviewed 2026-05-19 11:31 UTC · model grok-4.3
The pith
Agents should invoke external tools only when internal reasoning cannot reliably complete the task alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's central position is that agents should invoke external tools only when epistemically necessary, where epistemic necessity means a task cannot be completed reliably via the agent's internal reasoning over its current context without any external interaction. The Theory of Agent (ToA) framework models agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common failure modes such as overthinking and overacting arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. This supplies a normative criterion for tool use that works as
What carries the argument
The Theory of Agent (ToA) framework, which treats every agent step as a sequential decision on whether to resolve remaining uncertainty internally or delegate it externally.
If this is right
- Training would shift focus toward teaching agents to recognize when internal resolution is still possible.
- Evaluation would add explicit penalties or metrics for unnecessary external calls.
- Agent architectures would embed checks that ask whether the current context already supports a reliable internal answer.
- Long-term agent capability would improve because repeated internal resolution strengthens the model's own reasoning.
- System costs would drop by eliminating tool invocations that add no new information.
Where Pith is reading between the lines
- The same necessity test could be applied to decide when an agent should generate new internal training data instead of querying outside sources.
- Benchmarks could track whether agents under this rule show measurable gains in zero-tool accuracy over successive tasks.
- The principle may extend to multi-agent settings where one agent decides whether to query another or reason alone.
- Deployment pipelines could log uncertainty levels at each step to audit and refine the calibration process.
Load-bearing premise
Common agent failures such as overthinking and overacting come from miscalibrated uncertainty decisions rather than from weak reasoning or poor tool use.
What would settle it
A side-by-side test in which one set of agents is restricted to tool calls only after internal reasoning has been shown insufficient, while a control set uses tools freely, then measuring both immediate task success and later performance on tool-free versions of similar tasks.
Figures
read the original abstract
As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a position paper arguing that tool-augmented agents should invoke external tools only when epistemically necessary, defined as cases where a task cannot be completed reliably via the agent's internal reasoning over its current context without external interaction. It introduces the Theory of Agent (ToA) framework, which models agent behavior as sequential decisions under uncertainty about whether to resolve remaining uncertainty internally or delegate externally. The paper claims that failures such as overthinking and overacting stem from miscalibrated uncertainty decisions rather than deficiencies in reasoning or tool use, and discusses implications for training, evaluation, and design to avoid unnecessary delegation that could hinder internal capability development.
Significance. If the normative position holds, it supplies a principled criterion for tool invocation that complements reward- or success-optimized agent frameworks, with potential to improve efficiency, reduce unnecessary external calls, and encourage the development of stronger internal reasoning in agents. The conceptual framing via ToA offers a fresh perspective on agent decision-making under uncertainty that could inform future design and evaluation protocols.
major comments (1)
- [Theory of Agent (ToA) framework] The ToA framework (described in the main text following the abstract) treats agents as making sequential decisions on uncertainty resolution but provides no operational mechanism—such as an uncertainty metric, confidence threshold, self-evaluation protocol, or decision procedure—for determining epistemic necessity or sufficiency of internal reasoning. This renders the central recommendation non-actionable in practice and introduces a risk of circularity, since assessing whether internal reasoning suffices may itself require external interaction or tools.
minor comments (1)
- [Abstract] The abstract and introduction could more clearly distinguish the proposed normative criterion from existing decision-theoretic approaches to set expectations for readers familiar with RL or planning literature.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our position paper. We address the major comment on the Theory of Agent framework below, acknowledging its conceptual nature while outlining a targeted revision.
read point-by-point responses
-
Referee: The ToA framework (described in the main text following the abstract) treats agents as making sequential decisions on uncertainty resolution but provides no operational mechanism—such as an uncertainty metric, confidence threshold, self-evaluation protocol, or decision procedure—for determining epistemic necessity or sufficiency of internal reasoning. This renders the central recommendation non-actionable in practice and introduces a risk of circularity, since assessing whether internal reasoning suffices may itself require external interaction or tools.
Authors: We agree that the ToA framework is presented at a high-level conceptual stage without specifying concrete operational mechanisms such as explicit uncertainty metrics, thresholds, or self-evaluation protocols. This is consistent with the scope of a position paper, whose primary aim is to articulate a normative criterion for tool invocation rather than to deliver a ready-to-implement algorithm. The risk of circularity is a substantive concern that merits explicit discussion. To strengthen the manuscript, we will add a dedicated subsection under the ToA framework that outlines plausible pathways for operationalization. These include (i) leveraging existing LLM calibration and self-consistency methods to estimate internal sufficiency, (ii) defining epistemic necessity via a threshold on predictive entropy or token-level uncertainty, and (iii) iterative internal verification loops that avoid external calls until a stopping criterion is met. We will also note that any such mechanism remains an open research question and that the framework itself is intended to guide the design of such procedures rather than presuppose them. revision: yes
Circularity Check
No significant circularity in the conceptual framework
full rationale
The paper is a position paper that introduces a normative criterion for tool invocation based on epistemic necessity, defined directly as the inability to complete a task reliably via internal reasoning over current context. It frames agent behavior via the Theory of Agent as sequential decisions under uncertainty but provides no equations, fitted parameters, or derivations that reduce any claim to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed. The argument remains self-contained as a perspective that complements existing models without forcing the central claim through definitional equivalence or statistical forcing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents make sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally.
invented entities (1)
-
Theory of Agent (ToA)
no independent evidence
Forward citations
Cited by 5 Pith papers
-
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
-
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Reference graph
Works this paper leans on
-
[1]
Noam Kolt. Governing ai agents. arXiv preprint arXiv:2501.07913, 2025
-
[2]
Travelplanner: A benchmark for real-world planning with language agents
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024
-
[3]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[4]
Hongru Wang, Rui Wang, Boyang Xue, Heming Xia, Jingtao Cao, Zeming Liu, Jeff Z. Pan, and Kam-Fai Wong. AppBench: Planning of multiple APIs from various APPs for complex user instruction. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15322–15336, M...
work page 2024
-
[5]
Ui-tars: Pioneering automated gui interaction with native agents, 2025
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...
work page 2025
-
[6]
Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,
Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025
-
[7]
Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conver- sion efficiency prediction in organic photovoltaic devices. In Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 2024. 10
work page 2024
-
[8]
Synergpt: In-context learning for personalized drug synergy prediction and drug design
Carl Edwards, Aakanksha Naik, Tushar Khot, Martin Burke, Heng Ji, and Tom Hope. Synergpt: In-context learning for personalized drug synergy prediction and drug design. In Proc. 1st Conference on Language Modeling (COLM2024), 2024
work page 2024
-
[9]
Grzybowski, Bowen Jin, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D
Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Chetan Kumar Prasad, Sara Szymkuc, Bartosz A. Grzybowski, Bowen Jin, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, and Heng Ji. mclm: A function-infused and synthesis-friendly modular chemical language model. In arxiv, 2025
work page 2025
-
[10]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[11]
Self-discover: Large language models self-compose reasoning structures
Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures. Advances in Neural Information Processing Systems, 37:126032–126058, 2024
work page 2024
-
[12]
Tora: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations
-
[13]
Start: Self-taught reasoner with tools, 2025
Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, and Dayiheng Liu. Start: Self-taught reasoner with tools, 2025
work page 2025
-
[14]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
work page 2022
-
[15]
Reasoning with Language Model is Planning with World Model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhit- ing Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Xiaoqian Liu, Xingzhou Lou, Jianbin Jiao, and Junge Zhang. Position: Foundation agents as the paradigm shift for decision making. arXiv preprint arXiv:2405.17009, 2024
-
[17]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), March 2024
work page 2024
-
[18]
The ecological approach to visual perception: classic edition
James J Gibson. The ecological approach to visual perception: classic edition. Psychology press, 2014
work page 2014
-
[19]
What are cognitive tools? In Cognitive tools for learning , pages 1–6
David H Jonassen. What are cognitive tools? In Cognitive tools for learning , pages 1–6. Springer, 1992
work page 1992
-
[20]
Piet AM Kommers, David H Jonassen, and J Terry Mayes.Cognitive tools for learning. Springer, 1992
work page 1992
-
[21]
Self-reasoning language models: Unfold hidden reasoning chains with few reasoning catalyst
W ANG Hongru, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z Pan, Zeming Liu, and Kam-Fai Wong. Self-reasoning language models: Unfold hidden reasoning chains with few reasoning catalyst. In Workshop on Reasoning and Planning for Large Language Models, 2025
work page 2025
-
[22]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[23]
Chameleon: Plug-and-play compositional reasoning with large language models
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023
work page 2023
-
[24]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octo- tools: An agentic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Tpe: Towards better compositional reasoning over cognitive tools via multi-persona collaboration
Hongru Wang, Huimin Wang, Lingzhi Wang, Minda Hu, Rui Wang, Boyang Xue, Yongfeng Huang, and Kam-Fai Wong. Tpe: Towards better compositional reasoning over cognitive tools via multi-persona collaboration. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II...
work page 2024
-
[26]
Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs
Hongru Wang, Rui Wang, Fei Mi, Yang Deng, Zezhong Wang, Bin Liang, Ruifeng Xu, and Kam-Fai Wong. Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12047–12064, Singapore, December 20...
work page 2023
-
[27]
Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36:45870–45894, 2023
work page 2023
-
[28]
Albrecht, Peter Bell, and Amos Storkey
Dongge Han, Trevor McInroe, Adam Jelley, Stefano V . Albrecht, Peter Bell, and Amos Storkey. LLM-personalize: Aligning LLM planners with human preferences via reinforced self-training for housekeeping robots. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al- Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st Inter...
-
[30]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI Team. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[31]
Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025
work page 2025
-
[32]
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural ...
work page 2019
-
[34]
Knowledgeable or educated guess? revisiting language models as knowledge bases
Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joi...
work page 2021
-
[35]
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 5418–5426, Online, November 2020. Association for Computational Linguistics
work page 2020
-
[36]
A comprehensive survey of continual learning: Theory, method and application
Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024
work page 2024
-
[37]
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A survey on post-training of ...
work page 2025
-
[38]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. 12
work page 2020
-
[39]
Physics of language models: Part 3.3, knowledge capacity scaling laws
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[40]
Investigating the factual knowledge boundary of large language models with retrieval augmentation
Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Investigating the factual knowledge boundary of large language models with retrieval augmentation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International C...
work page 2025
-
[41]
Knowledge boundary of large language models: A survey, 2024
Moxin Li, Yong Zhao, Yang Deng, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, and Tat-Seng Chua. Knowledge boundary of large language models: A survey, 2024
work page 2024
-
[42]
Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan. Benchmarking knowledge boundary for large language models: A different perspective on model evaluation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2270–2286, Bangkok, Thai...
work page 2024
-
[43]
Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. Can we edit factual knowledge by in-context learning? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4862–4876, Singapore, December 2023. Association for Computa- tiona...
work page 2023
-
[44]
Editing language model-based knowledge graph embeddings
Siyuan Cheng, Ningyu Zhang, Bozhong Tian, Xi Chen, Qingbin Liu, and Huajun Chen. Editing language model-based knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17835–17843, 2024
work page 2024
-
[45]
InstructEd: Soft- instruction tuning for model editing with hops
XiaoQi Han, Ru Li, Xiaoli Li, Jiye Liang, Zifang Zhang, and Jeff Pan. InstructEd: Soft- instruction tuning for model editing with hops. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 14953–14968, Bangkok, Thailand, August 2024. Association for Computational Linguistics
work page 2024
-
[46]
Knowledge editing for large language models: A survey, 2024
Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey, 2024
work page 2024
-
[47]
Editing large language models: Problems, methods, and opportunities
Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and opportunities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, Singapore, December
work page 2023
-
[48]
Association for Computational Linguistics
-
[49]
Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang, Huimin Wang, Guanhua Chen, and Kam-Fai Wong. Self-DC: When to reason and when to act? self divide-and- conquer for compositional unknown questions. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ...
work page 2025
-
[50]
Smart: Self-aware agent for tool overuse mitigation, 2025
Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Smart: Self-aware agent for tool overuse mitigation, 2025
work page 2025
-
[51]
AbouElhamayed, Yueying Li, and Mohamed S
Yash Akhauri, Anthony Fei, Chi-Chih Chang, Ahmed F. AbouElhamayed, Yueying Li, and Mohamed S. Abdelfattah. Splitreason: Learning to offload reasoning, 2025
work page 2025
- [52]
-
[53]
Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024. 13
-
[54]
Otc: Optimal tool calls via reinforcement learning, 2025
Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Otc: Optimal tool calls via reinforcement learning, 2025
work page 2025
-
[55]
Training language models to reason efficiently, 2025
Daman Arora and Andrea Zanette. Training language models to reason efficiently, 2025
work page 2025
-
[56]
Embodied agent interface: Benchmarking llms for embodied decision making
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems , 37:100428–100534, 2024
work page 2024
-
[57]
Torl: Scaling tool-integrated rl, 2025
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025
work page 2025
-
[58]
Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023
Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023
-
[59]
Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025
work page 2025
-
[60]
On path to multimodal historical reasoning: Histbench and histagent, 2025
Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu...
work page 2025
-
[61]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learning Representations
-
[63]
Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025
-
[64]
Overcoming catastrophic forgetting in neural networks
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[65]
Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning LLMs on new knowledge encourage hallucinations? In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, Miami, Florida, USA, Nove...
work page 2024
-
[66]
R-tuning: Instructing large language models to say ‘I don‘t know’
Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don‘t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...
work page 2024
-
[67]
Don‘t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM col- laboration
Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don‘t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM col- laboration. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers)...
work page 2024
-
[68]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[69]
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don‘t know? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics
work page 2023
-
[70]
Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models
Alfonso Amayuelas, Kyle Wong, Liangming Pan, Wenhu Chen, and William Yang Wang. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 6416–6432, Bangkok, Thailand, August 2024. Association ...
work page 2024
-
[71]
The internal state of an LLM knows when it‘s lying
Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it‘s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[72]
LLM internal states reveal hallucination risk faced with a query
Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. LLM internal states reveal hallucination risk faced with a query. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Network...
work page 2024
-
[73]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023
work page 2023
-
[74]
Teaching Models to Express Their Uncertainty in Words
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[75]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Comput...
work page 2023
-
[77]
SaySelf: Teaching LLMs to express confidence with self-reflective rationales
Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, and Jing Gao. SaySelf: Teaching LLMs to express confidence with self-reflective rationales. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5985–5998, Miami, Florida, USA,...
work page 2024
-
[78]
Ualign: Leveraging uncertainty estimations for factuality alignment on large language models, 2024
Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. Ualign: Leveraging uncertainty estimations for factuality alignment on large language models, 2024
work page 2024
-
[79]
Exploring collaboration mechanisms for LLM agents: A social psychology view
Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Th...
work page 2024
- [80]
-
[81]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
work page 2025
-
[82]
Generate rather than retrieve: Large language models are strong context generators
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chen- guang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.