pith. sign in

arxiv: 2506.00886 · v4 · submitted 2025-06-01 · 💻 cs.AI

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Pith reviewed 2026-05-19 11:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords tool use in agentsepistemic necessityuncertainty calibrationinternal reasoningoverthinkingagent frameworksTheory of Agent
0
0 comments X

The pith

Agents should invoke external tools only when internal reasoning cannot reliably complete the task alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper argues that agents waste effort and slow their own progress when they reach for external tools without genuine need. Epistemic necessity is defined as the situation in which the agent's current internal reasoning over its context cannot finish the task reliably without outside interaction. The authors propose the Theory of Agent framework, which frames every step as a choice between resolving remaining uncertainty inside the model or delegating it outward. If the claim holds, agents would avoid habitual overthinking and overacting by learning better when to stop and decide internally. Readers would care because the rule directly targets inefficiency while supporting the gradual strengthening of built-in reasoning rather than endless external calls.

Core claim

The paper's central position is that agents should invoke external tools only when epistemically necessary, where epistemic necessity means a task cannot be completed reliably via the agent's internal reasoning over its current context without any external interaction. The Theory of Agent (ToA) framework models agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common failure modes such as overthinking and overacting arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. This supplies a normative criterion for tool use that works as

What carries the argument

The Theory of Agent (ToA) framework, which treats every agent step as a sequential decision on whether to resolve remaining uncertainty internally or delegate it externally.

If this is right

  • Training would shift focus toward teaching agents to recognize when internal resolution is still possible.
  • Evaluation would add explicit penalties or metrics for unnecessary external calls.
  • Agent architectures would embed checks that ask whether the current context already supports a reliable internal answer.
  • Long-term agent capability would improve because repeated internal resolution strengthens the model's own reasoning.
  • System costs would drop by eliminating tool invocations that add no new information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same necessity test could be applied to decide when an agent should generate new internal training data instead of querying outside sources.
  • Benchmarks could track whether agents under this rule show measurable gains in zero-tool accuracy over successive tasks.
  • The principle may extend to multi-agent settings where one agent decides whether to query another or reason alone.
  • Deployment pipelines could log uncertainty levels at each step to audit and refine the calibration process.

Load-bearing premise

Common agent failures such as overthinking and overacting come from miscalibrated uncertainty decisions rather than from weak reasoning or poor tool use.

What would settle it

A side-by-side test in which one set of agents is restricted to tool calls only after internal reasoning has been shown insufficient, while a control set uses tools freely, then measuring both immediate task success and later performance on tool-free versions of similar tasks.

Figures

Figures reproduced from arXiv: 2506.00886 by Amos Storkey, Boyang Xue, Cheng Qian, Heng Ji, Hongru Wang, Jiahao Qiu, Kam-Fai Wong, Manling Li, Mengdi Wang.

Figure 1
Figure 1. Figure 1: Conceptual framework of agent decision-making based on tool use and knowledge bound [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The tool use decision boundary of agent should align with its knowledge boundary. This [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training should dynamically adjust the decision boundary relative to the fixed knowledge [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inference-time alignment depends on real-time expansion of the knowledge boundary [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A high-level illustration of Lemma 1.1 for a specific model [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A high-level illustration of Lemma 2.2 for the all models [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript is a position paper arguing that tool-augmented agents should invoke external tools only when epistemically necessary, defined as cases where a task cannot be completed reliably via the agent's internal reasoning over its current context without external interaction. It introduces the Theory of Agent (ToA) framework, which models agent behavior as sequential decisions under uncertainty about whether to resolve remaining uncertainty internally or delegate externally. The paper claims that failures such as overthinking and overacting stem from miscalibrated uncertainty decisions rather than deficiencies in reasoning or tool use, and discusses implications for training, evaluation, and design to avoid unnecessary delegation that could hinder internal capability development.

Significance. If the normative position holds, it supplies a principled criterion for tool invocation that complements reward- or success-optimized agent frameworks, with potential to improve efficiency, reduce unnecessary external calls, and encourage the development of stronger internal reasoning in agents. The conceptual framing via ToA offers a fresh perspective on agent decision-making under uncertainty that could inform future design and evaluation protocols.

major comments (1)
  1. [Theory of Agent (ToA) framework] The ToA framework (described in the main text following the abstract) treats agents as making sequential decisions on uncertainty resolution but provides no operational mechanism—such as an uncertainty metric, confidence threshold, self-evaluation protocol, or decision procedure—for determining epistemic necessity or sufficiency of internal reasoning. This renders the central recommendation non-actionable in practice and introduces a risk of circularity, since assessing whether internal reasoning suffices may itself require external interaction or tools.
minor comments (1)
  1. [Abstract] The abstract and introduction could more clearly distinguish the proposed normative criterion from existing decision-theoretic approaches to set expectations for readers familiar with RL or planning literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our position paper. We address the major comment on the Theory of Agent framework below, acknowledging its conceptual nature while outlining a targeted revision.

read point-by-point responses
  1. Referee: The ToA framework (described in the main text following the abstract) treats agents as making sequential decisions on uncertainty resolution but provides no operational mechanism—such as an uncertainty metric, confidence threshold, self-evaluation protocol, or decision procedure—for determining epistemic necessity or sufficiency of internal reasoning. This renders the central recommendation non-actionable in practice and introduces a risk of circularity, since assessing whether internal reasoning suffices may itself require external interaction or tools.

    Authors: We agree that the ToA framework is presented at a high-level conceptual stage without specifying concrete operational mechanisms such as explicit uncertainty metrics, thresholds, or self-evaluation protocols. This is consistent with the scope of a position paper, whose primary aim is to articulate a normative criterion for tool invocation rather than to deliver a ready-to-implement algorithm. The risk of circularity is a substantive concern that merits explicit discussion. To strengthen the manuscript, we will add a dedicated subsection under the ToA framework that outlines plausible pathways for operationalization. These include (i) leveraging existing LLM calibration and self-consistency methods to estimate internal sufficiency, (ii) defining epistemic necessity via a threshold on predictive entropy or token-level uncertainty, and (iii) iterative internal verification loops that avoid external calls until a stopping criterion is met. We will also note that any such mechanism remains an open research question and that the framework itself is intended to guide the design of such procedures rather than presuppose them. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the conceptual framework

full rationale

The paper is a position paper that introduces a normative criterion for tool invocation based on epistemic necessity, defined directly as the inability to complete a task reliably via internal reasoning over current context. It frames agent behavior via the Theory of Agent as sequential decisions under uncertainty but provides no equations, fitted parameters, or derivations that reduce any claim to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed. The argument remains self-contained as a perspective that complements existing models without forcing the central claim through definitional equivalence or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The position rests on conceptual assumptions about agent decision-making under uncertainty and the causes of common failure modes, without independent empirical or formal grounding supplied in the abstract.

axioms (1)
  • domain assumption Agents make sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally.
    This is the core modeling choice of the Theory of Agent framework introduced in the abstract.
invented entities (1)
  • Theory of Agent (ToA) no independent evidence
    purpose: Framework for modeling agent decisions on internal versus external uncertainty resolution.
    Newly introduced perspective in the paper; no independent evidence or falsifiable prediction outside the position itself.

pith-pipeline@v0.9.0 · 5755 in / 1268 out tokens · 60742 ms · 2026-05-19T11:31:07.925344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

    cs.CL 2026-05 unverdicted novelty 7.0

    Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.

  2. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 7.0

    Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.

  3. MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security

    cs.CR 2026-04 conditional novelty 7.0

    MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.

  4. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

  5. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 4 Pith papers · 6 internal anchors

  1. [1]

    Governing ai agents

    Noam Kolt. Governing ai agents. arXiv preprint arXiv:2501.07913, 2025

  2. [2]

    Travelplanner: A benchmark for real-world planning with language agents

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024

  3. [3]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  4. [4]

    Pan, and Kam-Fai Wong

    Hongru Wang, Rui Wang, Boyang Xue, Heming Xia, Jingtao Cao, Zeming Liu, Jeff Z. Pan, and Kam-Fai Wong. AppBench: Planning of multiple APIs from various APPs for complex user instruction. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15322–15336, M...

  5. [5]

    Ui-tars: Pioneering automated gui interaction with native agents, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  6. [6]

    Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

    Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research. arXiv preprint arXiv:2502.04644, 2025

  7. [7]

    Glad: Synergizing molecular graphs and language descriptors for enhanced power conver- sion efficiency prediction in organic photovoltaic devices

    Thao Nguyen, Tiara Torres-Flores, Changhyun Hwang, Carl Edwards, Ying Diao, and Heng Ji. Glad: Synergizing molecular graphs and language descriptors for enhanced power conver- sion efficiency prediction in organic photovoltaic devices. In Proc. 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024), 2024. 10

  8. [8]

    Synergpt: In-context learning for personalized drug synergy prediction and drug design

    Carl Edwards, Aakanksha Naik, Tushar Khot, Martin Burke, Heng Ji, and Tom Hope. Synergpt: In-context learning for personalized drug synergy prediction and drug design. In Proc. 1st Conference on Language Modeling (COLM2024), 2024

  9. [9]

    Grzybowski, Bowen Jin, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D

    Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Chetan Kumar Prasad, Sara Szymkuc, Bartosz A. Grzybowski, Bowen Jin, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, and Heng Ji. mclm: A function-infused and synthesis-friendly modular chemical language model. In arxiv, 2025

  10. [10]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  11. [11]

    Self-discover: Large language models self-compose reasoning structures

    Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. Self-discover: Large language models self-compose reasoning structures. Advances in Neural Information Processing Systems, 37:126032–126058, 2024

  12. [12]

    Tora: A tool-integrated reasoning agent for mathematical problem solving

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations

  13. [13]

    Start: Self-taught reasoner with tools, 2025

    Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, and Dayiheng Liu. Start: Self-taught reasoner with tools, 2025

  14. [14]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  15. [15]

    Reasoning with Language Model is Planning with World Model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhit- ing Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023

  16. [16]

    Position: Foundation agents as the paradigm shift for decision making.arXiv preprint arXiv:2405.17009, 2024

    Xiaoqian Liu, Xingzhou Lou, Jianbin Jiao, and Junge Zhang. Position: Foundation agents as the paradigm shift for decision making. arXiv preprint arXiv:2405.17009, 2024

  17. [17]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), March 2024

  18. [18]

    The ecological approach to visual perception: classic edition

    James J Gibson. The ecological approach to visual perception: classic edition. Psychology press, 2014

  19. [19]

    What are cognitive tools? In Cognitive tools for learning , pages 1–6

    David H Jonassen. What are cognitive tools? In Cognitive tools for learning , pages 1–6. Springer, 1992

  20. [20]

    Springer, 1992

    Piet AM Kommers, David H Jonassen, and J Terry Mayes.Cognitive tools for learning. Springer, 1992

  21. [21]

    Self-reasoning language models: Unfold hidden reasoning chains with few reasoning catalyst

    W ANG Hongru, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z Pan, Zeming Liu, and Kam-Fai Wong. Self-reasoning language models: Unfold hidden reasoning chains with few reasoning catalyst. In Workshop on Reasoning and Planning for Large Language Models, 2025

  22. [22]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  23. [23]

    Chameleon: Plug-and-play compositional reasoning with large language models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36:43447–43478, 2023

  24. [24]

    OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

    Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octo- tools: An agentic framework with extensible tools for complex reasoning. arXiv preprint arXiv:2502.11271, 2025. 11

  25. [25]

    Tpe: Towards better compositional reasoning over cognitive tools via multi-persona collaboration

    Hongru Wang, Huimin Wang, Lingzhi Wang, Minda Hu, Rui Wang, Boyang Xue, Yongfeng Huang, and Kam-Fai Wong. Tpe: Towards better compositional reasoning over cognitive tools via multi-persona collaboration. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Proceedings, Part II...

  26. [26]

    Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs

    Hongru Wang, Rui Wang, Fei Mi, Yang Deng, Zezhong Wang, Bin Liang, Ruifeng Xu, and Kam-Fai Wong. Cue-CoT: Chain-of-thought prompting for responding to in-depth dialogue questions with LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12047–12064, Singapore, December 20...

  27. [27]

    Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36:45870–45894, 2023

  28. [28]

    Albrecht, Peter Bell, and Amos Storkey

    Dongge Han, Trevor McInroe, Adam Jelley, Stefano V . Albrecht, Peter Bell, and Amos Storkey. LLM-personalize: Aligning LLM planners with human preferences via reinforced self-training for housekeeping robots. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al- Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st Inter...

  29. [30]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI Team. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  30. [31]

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025

  31. [32]

    Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural ...

  32. [34]

    Knowledgeable or educated guess? revisiting language models as knowledge bases

    Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joi...

  33. [35]

    Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 5418–5426, Online, November 2020. Association for Computational Linguistics

  34. [36]

    A comprehensive survey of continual learning: Theory, method and application

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

  35. [37]

    Yu, and Jianfeng Gao

    Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, and Jianfeng Gao. A survey on post-training of ...

  36. [38]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. 12

  37. [39]

    Physics of language models: Part 3.3, knowledge capacity scaling laws

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. In The Thirteenth International Conference on Learning Representations, 2025

  38. [40]

    Investigating the factual knowledge boundary of large language models with retrieval augmentation

    Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Investigating the factual knowledge boundary of large language models with retrieval augmentation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors, Proceedings of the 31st International C...

  39. [41]

    Knowledge boundary of large language models: A survey, 2024

    Moxin Li, Yong Zhao, Yang Deng, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, and Tat-Seng Chua. Knowledge boundary of large language models: A survey, 2024

  40. [42]

    Benchmarking knowledge boundary for large language models: A different perspective on model evaluation

    Xunjian Yin, Xu Zhang, Jie Ruan, and Xiaojun Wan. Benchmarking knowledge boundary for large language models: A different perspective on model evaluation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2270–2286, Bangkok, Thai...

  41. [43]

    Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. Can we edit factual knowledge by in-context learning? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4862–4876, Singapore, December 2023. Association for Computa- tiona...

  42. [44]

    Editing language model-based knowledge graph embeddings

    Siyuan Cheng, Ningyu Zhang, Bozhong Tian, Xi Chen, Qingbin Liu, and Huajun Chen. Editing language model-based knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17835–17843, 2024

  43. [45]

    InstructEd: Soft- instruction tuning for model editing with hops

    XiaoQi Han, Ru Li, Xiaoli Li, Jiye Liang, Zifang Zhang, and Jeff Pan. InstructEd: Soft- instruction tuning for model editing with hops. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 14953–14968, Bangkok, Thailand, August 2024. Association for Computational Linguistics

  44. [46]

    Knowledge editing for large language models: A survey, 2024

    Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey, 2024

  45. [47]

    Editing large language models: Problems, methods, and opportunities

    Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and opportunities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, Singapore, December

  46. [48]

    Association for Computational Linguistics

  47. [49]

    Self-DC: When to reason and when to act? self divide-and- conquer for compositional unknown questions

    Hongru Wang, Boyang Xue, Baohang Zhou, Tianhua Zhang, Cunxiang Wang, Huimin Wang, Guanhua Chen, and Kam-Fai Wong. Self-DC: When to reason and when to act? self divide-and- conquer for compositional unknown questions. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ...

  48. [50]

    Smart: Self-aware agent for tool overuse mitigation, 2025

    Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Smart: Self-aware agent for tool overuse mitigation, 2025

  49. [51]

    AbouElhamayed, Yueying Li, and Mohamed S

    Yash Akhauri, Anthony Fei, Chi-Chih Chang, Ahmed F. AbouElhamayed, Yueying Li, and Mohamed S. Abdelfattah. Splitreason: Learning to offload reasoning, 2025

  50. [52]

    Thompson

    Rakefet Ackerman and Valerie A. Thompson. Meta-reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences, 21(8):607–617, 2017

  51. [53]

    Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024

    Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024. 13

  52. [54]

    Otc: Optimal tool calls via reinforcement learning, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Otc: Optimal tool calls via reinforcement learning, 2025

  53. [55]

    Training language models to reason efficiently, 2025

    Daman Arora and Andrea Zanette. Training language models to reason efficiently, 2025

  54. [56]

    Embodied agent interface: Benchmarking llms for embodied decision making

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems , 37:100428–100534, 2024

  55. [57]

    Torl: Scaling tool-integrated rl, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025

  56. [58]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023

  57. [59]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

  58. [60]

    On path to multimodal historical reasoning: Histbench and histagent, 2025

    Jiahao Qiu, Fulian Xiao, Yimin Wang, Yuchen Mao, Yijia Chen, Xinzhe Juan, Siran Wang, Xuan Qi, Tongcheng Zhang, Zixin Yao, Jiacheng Guo, Yifu Lu, Charles Argon, Jundi Cui, Daixin Chen, Junran Zhou, Shuyao Zhou, Zhanpeng Zhou, Ling Yang, Shilong Liu, Hongru Wang, Kaixuan Huang, Xun Jiang, Yuming Cao, Yue Chen, Yunfei Chen, Zhengyi Chen, Ruowei Dai, Mengqiu...

  59. [61]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024

  60. [62]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The Twelfth International Conference on Learning Representations

  61. [63]

    Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

    Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, and Jiecao Chen. Agent-r: Train- ing language model agents to reflect via iterative self-training.arXiv preprint arXiv:2501.11425, 2025

  62. [64]

    Overcoming catastrophic forgetting in neural networks

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  63. [65]

    Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning LLMs on new knowledge encourage hallucinations? In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, Miami, Florida, USA, Nove...

  64. [66]

    R-tuning: Instructing large language models to say ‘I don‘t know’

    Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don‘t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...

  65. [67]

    Don‘t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM col- laboration

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don‘t hallucinate, abstain: Identifying LLM knowledge gaps via multi-LLM col- laboration. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers)...

  66. [68]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  67. [69]

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don‘t know? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada, July 2023. Association for Computational Linguistics

  68. [70]

    Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models

    Alfonso Amayuelas, Kyle Wong, Liangming Pan, Wenhu Chen, and William Yang Wang. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 6416–6432, Bangkok, Thailand, August 2024. Association ...

  69. [71]

    The internal state of an LLM knows when it‘s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it‘s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Association for Computational Linguistics

  70. [72]

    LLM internal states reveal hallucination risk faced with a query

    Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. LLM internal states reveal hallucination risk faced with a query. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Network...

  71. [73]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023

  72. [74]

    Teaching Models to Express Their Uncertainty in Words

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022

  73. [75]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023

  74. [76]

    SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Comput...

  75. [77]

    SaySelf: Teaching LLMs to express confidence with self-reflective rationales

    Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, and Jing Gao. SaySelf: Teaching LLMs to express confidence with self-reflective rationales. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5985–5998, Miami, Florida, USA,...

  76. [78]

    Ualign: Leveraging uncertainty estimations for factuality alignment on large language models, 2024

    Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. Ualign: Leveraging uncertainty estimations for factuality alignment on large language models, 2024

  77. [79]

    Exploring collaboration mechanisms for LLM agents: A social psychology view

    Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Th...

  78. [80]

    Openai o1 system card, 2024

    OpenAI Team. Openai o1 system card, 2024

  79. [81]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  80. [82]

    Generate rather than retrieve: Large language models are strong context generators

    Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chen- guang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations

Showing first 80 references.