Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Hengyu Shi; Junfeng Luo; Junhao Su; Junwei Yang; Tianyang Han; Yuanliang Wan; Yurui Qiu

arxiv: 2509.18847 · v3 · submitted 2025-09-23 · 💻 cs.CV · cs.AI· cs.CL

Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su , Yuanliang Wan , Junwei Yang , Hengyu Shi , Tianyang Han , Junfeng Luo , Yurui Qiu This is my paper

Pith reviewed 2026-05-18 14:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords structured reflectiontool-augmented LLMserror recoverymulti-turn tool callsreinforcement learning for agentsTool-Reflection-BenchDAPO and GSPO objectives

0 comments

The pith

Structured reflection turns LLM tool failures into explicit diagnoses and proposed fixes that raise multi-turn success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vague self-reflection prompts cause agents to repeat mistakes in tool-augmented interactions, while making reflection a short, evidence-based diagnosis followed by a corrected call creates a trainable step that agents can optimize directly. A reader would care because tool-using agents are increasingly common yet brittle in extended conversations, and this method gives a concrete way to learn recovery instead of relying on imitation alone. Training combines DAPO and GSPO objectives with rewards that score reflection quality, call executability, and outcome consistency. Evaluation on BFCL v3 and the new Tool-Reflection-Bench, which checks structural validity and parameter correctness on mini trajectories, reports large gains in error recovery and fewer wasted calls.

Core claim

The authors claim that explicit structured reflection—where the model first diagnoses the prior failure using concrete evidence from the last step and then outputs a correct, executable follow-up tool call—when optimized jointly with DAPO and GSPO plus a tool-specific reward, produces reliable multi-turn tool interaction and lets agents learn error repair systematically rather than through heuristic prompting.

What carries the argument

The Reflect-then-Call-then-Final sequence, where reflection is a short diagnosis-plus-proposal action optimized by the combined DAPO-GSPO objectives and tailored reward.

If this is right

Multi-turn tool-call success rises and redundant calls drop when reflection is treated as an explicit, rewarded action.
Agents trained this way recover from failures more often by diagnosing with prior evidence rather than repeating the same call.
The approach supplies a reproducible training path for learning repair strategies instead of depending on coarse imitation or vague prompts.
Disjoint train-test splits in Tool-Reflection-Bench allow direct measurement of whether the reflection skill generalizes within the benchmark distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same explicit-reflection loop could be tested on non-tool agent settings such as web navigation or code editing to see whether diagnosis-plus-fix transfers.
If the method scales, it suggests a lighter alternative to full human preference data by letting the agent generate its own recovery examples during rollout.
A natural extension would be to let the reflection step also revise earlier assumptions in the trajectory, turning single-step repair into short-horizon replanning.

Load-bearing premise

That training on short, evidence-linked reflections with the chosen reward will produce error-recovery behavior that transfers to real-world tool tasks instead of overfitting to the benchmark trajectories.

What would settle it

Measure error-recovery rate on a fresh collection of multi-turn tool tasks drawn from APIs and workflows absent from both BFCL v3 and Tool-Reflection-Bench; if the structured-reflection model shows no gain over a plain call baseline, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2509.18847 by Hengyu Shi, Junfeng Luo, Junhao Su, Junwei Yang, Tianyang Han, Yuanliang Wan, Yurui Qiu.

**Figure 2.** Figure 2: We illustrate the effectiveness of our method with an example. As shown in the figure, the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The reward curves of llama-3.1-8B and Qwen2.5-7B during training, showing an overall [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to 'think more' instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns reflection into an explicit trainable step for tool agents and backs it with a programmatic benchmark, but the gains may not transfer beyond the bench's controlled error patterns.

read the letter

The core contribution is making reflection a distinct, optimizable action: the model first diagnoses the prior tool-call failure with evidence, then outputs a corrected call. They train this with DAPO plus GSPO and a reward mix that scores reflection quality, call correctness, and final answer. This is a direct extension of self-reflection ideas, but the explicit Reflect-then-Call format plus the tailored rewards is new enough to be worth testing.

Referee Report

2 major / 2 minor

Summary. The paper claims that structured reflection—where the agent explicitly diagnoses failures using evidence from the previous step and proposes a correct, executable follow-up call—can be made trainable and controllable. Training combines DAPO and GSPO objectives with a tailored reward scheme for the Reflect-then-Call-then-Final strategy. A new lightweight benchmark, Tool-Reflection-Bench, generates mini-trajectories of erroneous call, reflection, and corrected call that are programmatically verified for structural validity, executability, parameter correctness, and result consistency, using disjoint train/test splits. Experiments on BFCL v3 and Tool-Reflection-Bench report large gains in multi-turn tool-call success, error recovery, and reduction of redundant calls.

Significance. If the improvements hold under stronger controls for generalizability, the work offers a concrete, reproducible path for optimizing explicit reflection in tool-augmented agents, moving beyond heuristic prompts or coarse imitation learning. The verifiable benchmark design is a constructive contribution that could support more precise evaluation of error-diagnosis behavior in the field.

major comments (2)

Tool-Reflection-Bench construction: tasks are generated as short, programmatically defined trajectories (erroneous call → reflection → corrected call) with only disjoint train/test splits inside the same distribution. This risks the model internalizing the benchmark's error-generation rules and reflection format rather than acquiring robust, evidence-based diagnosis that transfers to arbitrary multi-turn failures with different error types or interaction lengths, which is load-bearing for the central claim of generalizable recovery stated in the abstract.
BFCL v3 evaluation: reported gains are noted, but the experiments do not test whether the learned reflection policy succeeds when error types, tool interfaces, or interaction lengths differ from the training distribution, leaving the generalizability assertion without direct support.

minor comments (2)

Abstract: the claim of 'large gains' is stated without any numerical values, baseline comparisons, or effect sizes; adding one or two key quantitative results would make the summary more informative.
Reward scheme: the tailored rewards for reflection quality, call correctness, and final answer are mentioned but their exact weighting and formulation are not detailed in the provided text; a short equation or table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging valid concerns about generalizability while clarifying the design choices and outlining targeted revisions.

read point-by-point responses

Referee: Tool-Reflection-Bench construction: tasks are generated as short, programmatically defined trajectories (erroneous call → reflection → corrected call) with only disjoint train/test splits inside the same distribution. This risks the model internalizing the benchmark's error-generation rules and reflection format rather than acquiring robust, evidence-based diagnosis that transfers to arbitrary multi-turn failures with different error types or interaction lengths, which is load-bearing for the central claim of generalizable recovery stated in the abstract.

Authors: We agree that the benchmark's programmatic generation and in-distribution splits represent a limitation for fully substantiating robustness to arbitrary multi-turn failures. The design prioritizes verifiable, scalable evaluation of structural validity, executability, and result consistency, with disjoint splits to avoid instance-level memorization. However, this does not directly test transfer to unseen error types or longer sequences. In the revised manuscript we will expand Tool-Reflection-Bench with additional error categories (e.g., semantic mismatches and novel tool-interface failures) and longer trajectories, reporting results on these held-out variants to provide stronger evidence for generalizable recovery. revision: yes
Referee: BFCL v3 evaluation: reported gains are noted, but the experiments do not test whether the learned reflection policy succeeds when error types, tool interfaces, or interaction lengths differ from the training distribution, leaving the generalizability assertion without direct support.

Authors: BFCL v3 features diverse real-world tools and naturally occurring multi-turn errors that differ in distribution from the synthetic trajectories used for training. The reported improvements in success rate and reduced redundant calls provide evidence of practical transfer. We nevertheless concur that explicit controls for shifts in error type, interface, and length would better support the generalizability claim. In revision we will add a breakdown of BFCL v3 results by interaction length and error category, plus a small-scale OOD test using modified tool interfaces, to directly address this gap. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method and disjoint evaluation are independent of inputs

full rationale

The paper proposes structured reflection as an explicit trainable action optimized via DAPO/GSPO objectives plus a tailored reward for the Reflect-Call-Final sequence, then evaluates gains on BFCL v3 and the new Tool-Reflection-Bench. Tasks use programmatically verified mini-trajectories with explicitly disjoint train/test splits inside the benchmark distribution. No equations, fitted parameters, or self-referential definitions are shown that would make the reported success rates or error-recovery improvements equivalent to the training construction by definition. The central claim rests on experimental outcomes rather than any reduction to prior inputs or self-citations that bear the full load of the result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that explicit short reflections are both learnable and sufficient to drive genuine error recovery beyond the benchmark distribution.

free parameters (1)

Reward weights for reflection quality, call correctness, and final answer
The tailored reward scheme for the Reflect-Call-Final sequence necessarily involves chosen or fitted scalar weights.

axioms (1)

domain assumption A short structured reflection can be produced by the model and will contain diagnostically useful evidence from the prior step.
Invoked when defining the reflection action as the trainable unit.

pith-pipeline@v0.9.0 · 5801 in / 1208 out tokens · 37173 ms · 2026-05-18T14:56:30.603796+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 7.0

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 6.0

TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents
cs.AI 2026-03 unverdicted novelty 5.0

ALTK supplies reusable middleware components that systematically address failure modes across the full AI agent lifecycle from request to response.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Function calling in large language models: Industrial practices, challenges, and future direc- tions

MAOLIN W ANG, YINGYI ZHANG, CUNYIN PENG, YICHENG CHEN, WEI ZHOU, JIN- JIE GU, CHENYI ZHUANG, RUOCHENG GUO, BOWEN YU, W ANYU W ANG, et al. Function calling in large language models: Industrial practices, challenges, and future direc- tions. 2025

work page 2025
[3]

Planning, creation, usage: Benchmark- ing llms for comprehensive tool utilization in real-world complex scenarios.arXiv preprint arXiv:2401.17167, 2024

Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng, Yasheng Wang, Lifeng Shang, et al. Planning, creation, usage: Benchmark- ing llms for comprehensive tool utilization in real-world complex scenarios.arXiv preprint arXiv:2401.17167, 2024. 24

work page arXiv 2024
[4]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Tool learning with large language models: A survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and J Wen. Tool learning with large language models: A survey. corr abs/2405.17935(2024). arXiv preprint arXiv:2405.17935, 2024

work page arXiv 2024
[6]

LLM4EDA: Emerging Progress in Large Language Models for Electronic Design Automation,

Ruizhe Zhong, Xingbo Du, Shixiong Kai, Zhentao Tang, Siyuan Xu, Hui-Ling Zhen, Jianye Hao, Qiang Xu, Mingxuan Yuan, and Junchi Yan. Llm4eda: Emerging progress in large language models for electronic design automation.arXiv preprint arXiv:2401.12224, 2023

work page arXiv 2023
[7]

Equipping language model s with tool use capability for tabular data analysis in ﬁnance

Adrian Theuma and Ehsan Shareghi. Equipping language models with tool use capability for tabular data analysis in finance.arXiv preprint arXiv:2401.15328, 2024

work page arXiv 2024
[8]

Large language models can plan your travels rigorously with formal verification tools.CoRR, 2024

Yilun Hao, Yongchao Chen, Yang Zhang, and Chuchu Fan. Large language models can plan your travels rigorously with formal verification tools.CoRR, 2024

work page 2024
[9]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision- language models.arXiv preprint arXiv:2504.11468, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Sparse rewards can self-train dialogue agents.arXiv preprint arXiv:2409.04617, 2024

Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, and Yi Yang. Sparse rewards can self-train dialogue agents.arXiv preprint arXiv:2409.04617, 2024

work page arXiv 2024
[12]

arXiv preprint arXiv:2503.23383 , year=

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

work page arXiv 2025
[13]

Cybersecurity For Beginners

Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, et al. Tl-training: A task-feature-based framework for training large language models in tool use.arXiv preprint arXiv:2412.15495, 2024

work page arXiv 2024
[14]

Supercorrect: Supervising and correcting language models with error-driven insights.arXiv preprint arXiv:2410.09008, 9, 2024

Ling Yang, Zhaochen Yu, Tianjun Zhang, Minkai Xu, Joseph E Gonzalez, Bin Cui, and Shuicheng Yan. Supercorrect: Supervising and correcting language models with error-driven insights.arXiv preprint arXiv:2410.09008, 9, 2024

work page arXiv 2024
[15]

Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

work page arXiv 2024
[16]

ToolACE : Winning the points of LLM function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[17]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page
[18]

xlam: A family of large action models to empower ai agent systems.arXiv preprint arXiv:2409.03215, 2024

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. xlam: A family of large action models to empower ai agent systems.arXiv preprint arXiv:2409.03215, 2024

work page arXiv 2024
[19]

FunReason : Enhancing large language models' function calling via self-refinement multiscale loss and automated data refinement

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, and Chenyi Zhuang. Funreason: Enhancing large language models’ function calling via self-refinement multiscale loss and automated data refinement.arXiv preprint arXiv:2505.20192, 2025

work page arXiv 2025
[20]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023. 25

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Acebench: Who wins the match point in tool learning?arXiv e-prints, pages arXiv–2501, 2025

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool learning?arXiv e-prints, pages arXiv–2501, 2025

work page 2025
[22]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025

Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, and Talal Rahwan. Self-reflection makes large language models safer, less biased, and ideologically neutral.arXiv preprint arXiv:2406.10400, 2024

work page arXiv 2024
[24]

Self-reflection in llm agents: Effects on problem-solving performance,

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

work page arXiv 2024
[25]

Self-refine: Iterative re- finement with self-feedback.Advances in Neural Information Processing Systems, 36:46534– 46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative re- finement with self-feedback.Advances in Neural Information Processing Systems, 36:46534– 46594, 2023

work page 2023
[26]

Large language models can self-correct with key condition verification.arXiv preprint arXiv:2405.14092, 2024

Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang. Large language models can self-correct with key condition verification.arXiv preprint arXiv:2405.14092, 2024

work page arXiv 2024
[27]

Correcting hallucinations in news sum- maries: Exploration of self-correcting llm methods with external knowledge.arXiv preprint arXiv:2506.19607, 2025

Juraj Vladika, Ihsan Soydemir, and Florian Matthes. Correcting hallucinations in news sum- maries: Exploration of self-correcting llm methods with external knowledge.arXiv preprint arXiv:2506.19607, 2025

work page arXiv 2025
[28]

Pag: Multi-turn reinforced llm self-correction with policy as generative verifier

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, and Lin Yan. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier. arXiv preprint arXiv:2506.10406, 2025

work page arXiv 2025
[29]

Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,

Xutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He, Sinong Wang, et al. Boosting llm reasoning via spontaneous self- correction.arXiv preprint arXiv:2506.06923, 2025

work page arXiv 2025
[30]

Self-reflective retrieval-augmented generation (self-rag) in an- alytical systems

RI Saveliev and MV Dendiuk. Self-reflective retrieval-augmented generation (self-rag) in an- alytical systems. InForestry Education and Science: Current Challenges and Development Prospects. International Science-Practical Conference, October 23-25, 2024, Lviv, Ukraine, 2024

work page 2024
[31]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

work page 2025
[35]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 26

work page 2024
[36]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025
[39]

Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, May 2024

OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, May 2024. Accessed: 2025-09-25

work page 2024
[40]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. Accessed: 2025-09-25

work page 2024
[41]

Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, April 2025

OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, April 2025. Accessed: 2025-09-25. 27

work page 2025

[1] [1]

Function calling in large language models: Industrial practices, challenges, and future direc- tions

MAOLIN W ANG, YINGYI ZHANG, CUNYIN PENG, YICHENG CHEN, WEI ZHOU, JIN- JIE GU, CHENYI ZHUANG, RUOCHENG GUO, BOWEN YU, W ANYU W ANG, et al. Function calling in large language models: Industrial practices, challenges, and future direc- tions. 2025

work page 2025

[2] [3]

Planning, creation, usage: Benchmark- ing llms for comprehensive tool utilization in real-world complex scenarios.arXiv preprint arXiv:2401.17167, 2024

Shijue Huang, Wanjun Zhong, Jianqiao Lu, Qi Zhu, Jiahui Gao, Weiwen Liu, Yutai Hou, Xingshan Zeng, Yasheng Wang, Lifeng Shang, et al. Planning, creation, usage: Benchmark- ing llms for comprehensive tool utilization in real-world complex scenarios.arXiv preprint arXiv:2401.17167, 2024. 24

work page arXiv 2024

[3] [4]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [5]

Tool learning with large language models: A survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and J Wen. Tool learning with large language models: A survey. corr abs/2405.17935(2024). arXiv preprint arXiv:2405.17935, 2024

work page arXiv 2024

[5] [6]

LLM4EDA: Emerging Progress in Large Language Models for Electronic Design Automation,

Ruizhe Zhong, Xingbo Du, Shixiong Kai, Zhentao Tang, Siyuan Xu, Hui-Ling Zhen, Jianye Hao, Qiang Xu, Mingxuan Yuan, and Junchi Yan. Llm4eda: Emerging progress in large language models for electronic design automation.arXiv preprint arXiv:2401.12224, 2023

work page arXiv 2023

[6] [7]

Equipping language model s with tool use capability for tabular data analysis in ﬁnance

Adrian Theuma and Ehsan Shareghi. Equipping language models with tool use capability for tabular data analysis in finance.arXiv preprint arXiv:2401.15328, 2024

work page arXiv 2024

[7] [8]

Large language models can plan your travels rigorously with formal verification tools.CoRR, 2024

Yilun Hao, Yongchao Chen, Yang Zhang, and Chuchu Fan. Large language models can plan your travels rigorously with formal verification tools.CoRR, 2024

work page 2024

[8] [9]

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision- language models.arXiv preprint arXiv:2504.11468, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [10]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [11]

Sparse rewards can self-train dialogue agents.arXiv preprint arXiv:2409.04617, 2024

Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, and Yi Yang. Sparse rewards can self-train dialogue agents.arXiv preprint arXiv:2409.04617, 2024

work page arXiv 2024

[11] [12]

arXiv preprint arXiv:2503.23383 , year=

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383, 2025

work page arXiv 2025

[12] [13]

Cybersecurity For Beginners

Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, et al. Tl-training: A task-feature-based framework for training large language models in tool use.arXiv preprint arXiv:2412.15495, 2024

work page arXiv 2024

[13] [14]

Supercorrect: Supervising and correcting language models with error-driven insights.arXiv preprint arXiv:2410.09008, 9, 2024

Ling Yang, Zhaochen Yu, Tianjun Zhang, Minkai Xu, Joseph E Gonzalez, Bin Cui, and Shuicheng Yan. Supercorrect: Supervising and correcting language models with error-driven insights.arXiv preprint arXiv:2410.09008, 9, 2024

work page arXiv 2024

[14] [15]

Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

Mingyang Chen, Haoze Sun, Tianpeng Li, Fan Yang, Hao Liang, Keer Lu, Bin Cui, Wentao Zhang, Zenan Zhou, and Weipeng Chen. Facilitating multi-turn function calling for llms via compositional instruction tuning.arXiv preprint arXiv:2410.12952, 2024

work page arXiv 2024

[15] [16]

ToolACE : Winning the points of LLM function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024

[16] [17]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page

[17] [18]

xlam: A family of large action models to empower ai agent systems.arXiv preprint arXiv:2409.03215, 2024

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. xlam: A family of large action models to empower ai agent systems.arXiv preprint arXiv:2409.03215, 2024

work page arXiv 2024

[18] [19]

FunReason : Enhancing large language models' function calling via self-refinement multiscale loss and automated data refinement

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Cunyin Peng, Yicheng Chen, Xiangyu Zhao, Jinjie Gu, and Chenyi Zhuang. Funreason: Enhancing large language models’ function calling via self-refinement multiscale loss and automated data refinement.arXiv preprint arXiv:2505.20192, 2025

work page arXiv 2025

[19] [20]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023. 25

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [21]

Acebench: Who wins the match point in tool learning?arXiv e-prints, pages arXiv–2501, 2025

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. Acebench: Who wins the match point in tool learning?arXiv e-prints, pages arXiv–2501, 2025

work page 2025

[21] [22]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [23]

Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025

Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, and Talal Rahwan. Self-reflection makes large language models safer, less biased, and ideologically neutral.arXiv preprint arXiv:2406.10400, 2024

work page arXiv 2024

[23] [24]

Self-reflection in llm agents: Effects on problem-solving performance,

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

work page arXiv 2024

[24] [25]

Self-refine: Iterative re- finement with self-feedback.Advances in Neural Information Processing Systems, 36:46534– 46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative re- finement with self-feedback.Advances in Neural Information Processing Systems, 36:46534– 46594, 2023

work page 2023

[25] [26]

Large language models can self-correct with key condition verification.arXiv preprint arXiv:2405.14092, 2024

Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, and Meng Jiang. Large language models can self-correct with key condition verification.arXiv preprint arXiv:2405.14092, 2024

work page arXiv 2024

[26] [27]

Correcting hallucinations in news sum- maries: Exploration of self-correcting llm methods with external knowledge.arXiv preprint arXiv:2506.19607, 2025

Juraj Vladika, Ihsan Soydemir, and Florian Matthes. Correcting hallucinations in news sum- maries: Exploration of self-correcting llm methods with external knowledge.arXiv preprint arXiv:2506.19607, 2025

work page arXiv 2025

[27] [28]

Pag: Multi-turn reinforced llm self-correction with policy as generative verifier

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, and Lin Yan. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier. arXiv preprint arXiv:2506.10406, 2025

work page arXiv 2025

[28] [29]

Boosting llm reasoning via spontaneous self-correction.arXiv preprint arXiv:2506.06923,

Xutong Zhao, Tengyu Xu, Xuewei Wang, Zhengxing Chen, Di Jin, Liang Tan, Zishun Yu, Zhuokai Zhao, Yun He, Sinong Wang, et al. Boosting llm reasoning via spontaneous self- correction.arXiv preprint arXiv:2506.06923, 2025

work page arXiv 2025

[29] [30]

Self-reflective retrieval-augmented generation (self-rag) in an- alytical systems

RI Saveliev and MV Dendiuk. Self-reflective retrieval-augmented generation (self-rag) in an- alytical systems. InForestry Education and Science: Current Challenges and Development Prospects. International Science-Practical Conference, October 23-25, 2024, Lviv, Ukraine, 2024

work page 2024

[30] [31]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [33]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [34]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

work page 2025

[34] [35]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 26

work page 2024

[35] [36]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [38]

L., Li, B., Lei, B., Wang, B., Rong, B., Wang, C., Zhang, C., Gao, C., Zhang, C., Sun, C., et al

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025

[38] [39]

Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, May 2024

OpenAI. Hello gpt-4o.https://openai.com/index/hello-gpt-4o/, May 2024. Accessed: 2025-09-25

work page 2024

[39] [40]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. Accessed: 2025-09-25

work page 2024

[40] [41]

Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, April 2025

OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, April 2025. Accessed: 2025-09-25. 27

work page 2025