arxiv: 2604.19821 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.SE

Recognition: unknown

JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents

Anshul Mittal, Dan Roth, Fahad Shah, Fang Tu, Jyotika Singh, Miguel Ballesteros, Sandip Ghoshal, Shailender Singh, Sujeeth Bharadwaj, Sujith Ravi, Weiyi Sun, Yassine Benajiba

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:43 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LLM agentstool callingprompt optimizationreflective optimizationtool selection accuracyslot fillingmulti-tool benchmarksjoint optimization

0 comments

The pith

Jointly refining agent instructions and tool descriptions via rollout reflection improves tool selection and slot filling for LLM agents with large tool sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLM agents fail at tool use mainly when prompts are generic and tool schemas are vague, causing wrong tool picks and bad parameter values as tool counts rise. JTPRO addresses this by running iterative reflections on execution traces to jointly tune one shared instruction block and each tool's schema details, keeping only the local cues needed for correct choices. A sympathetic reader would care because reliable tool calling is a bottleneck for agents in real domains like software or data tasks, where manual prompt fixes do not scale. The framework is evaluated on multi-tool benchmarks with metrics for tool accuracy, slot accuracy, and full task success. It reports gains over chain-of-thought agents and other prompt optimizers, with ablations showing the joint approach outperforms tuning either part alone.

Core claim

JTPRO is a framework that iteratively applies rollout-driven reflection to co-optimize global instructions and per-tool schema and argument descriptions for accurate tool selection and argument instantiation in large tool inventories. The method is designed to retain only tool-local cues required for disambiguation and slot filling. Across benchmarks that vary the number of tools, JTPRO raises overall success rate by 5 to 20 percent relative to strong baselines such as CoT-style agents and reflective prompt optimizers. Separate ablations confirm that optimizing instructions and tool schemas together yields more effective and robust gains than optimizing either component in isolation.

What carries the argument

Joint Tool-Prompt Reflective Optimization, an iterative process that co-optimizes a shared instruction set and individual tool schemas by reflecting on rollout traces to correct selection and instantiation errors.

If this is right

Joint optimization of instructions and tool schemas produces higher overall success than optimizing either one separately.
The framework applies in trace-supervised settings where rollout outcomes supply the signal for reflection.
Gains appear across benchmarks that differ in total number of tools.
Only tool-local cues are retained, reducing the risk of over-generalizing changes.
The approach targets both tool selection accuracy and slot-filling accuracy as separate but related improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reflection on traces proves reliable, similar joint optimization loops could be applied to other agent components such as planning steps or memory retrieval.
The method may lower the cost of deploying agents in new domains by automating prompt and schema tuning from a modest set of example traces.
A natural extension would test whether the refined schemas transfer to previously unseen tools without further reflection.
The current results rest on access to evaluation rollouts, so performance in fully open-ended agent interactions without traces remains an open question.

Load-bearing premise

Ambiguous tool descriptions and under-specified agent instructions are the primary causes of mis-selection and incorrect slot filling, and rollout-driven reflection can reliably identify and fix them without adding new biases or overfitting to the evaluation traces.

What would settle it

Run JTPRO on a fresh benchmark where tools have deliberately ambiguous descriptions and under-specified schemas; if overall success rate does not rise or falls below the CoT baseline, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.19821 by Anshul Mittal, Dan Roth, Fahad Shah, Fang Tu, Jyotika Singh, Miguel Ballesteros, Sandip Ghoshal, Shailender Singh, Sujeeth Bharadwaj, Sujith Ravi, Weiyi Sun, Yassine Benajiba.

**Figure 1.** Figure 1: Impact of slot-filling accuracy. Slot filling drives end-to-end success: On the Enterprise ToolInventory Dataset (ETID) with complex schemas, we report TSA, SFA, and OSR; green overlays show absolute gains from JTPRO over baselines, highlighting that argument correctness is critical for OSR. (Singh, 2023) across domains (Zhang, 2024; Meghwani et al., 2025; Singh et al., 2025). In this work, we focus sp… view at source ↗

**Figure 2.** Figure 2: Tool scaling failures and slot-filling impact. (a) All tools in context: On ToolACE with an augmented inventory, tool selection accuracy drops as the tool set grows (300 to 1000), even for larger-context frontier models. (b) Top-k retrieval: A basic RAG with reranker stage (top-20) does not remove the drop, indicating residual tool disambiguation/argument issues. Attempts to encode exhaustive tool and slot… view at source ↗

**Figure 3.** Figure 3: Motivating context refinement with JTPRO, Tool disambiguation: Baseline docs under-specify two similar tools, causing mis-selection; JTPRO adds brief per-tool decision rules (highlighted) to enable correct choice. slot semantics including date/time fields, numeric bounds, boolean parameters, sorting conventions, and currency/units into P, and replace redundant tool-local descriptions with short pointers to… view at source ↗

**Figure 4.** Figure 4: JTPRO optimization loop (block-diagram view). JTPRO maintains a pool of candidate contexts (global instructions P and tool schemas {Ti}) and repeatedly (i) selects a candidate via Pareto-based sampling, (ii) runs minibatch rollouts on Dtr to compute tool-use metrics (TSA, SFA, OSR) and aggregate error feedback, and (iii) proposes localized edits to both P and the implicated tool schemas. The edited instruc… view at source ↗

**Figure 7.** Figure 7: ToolACE scaling results across models and metrics. For each model, we report TSA, SFA (conditional on correct tool), and OSR at 500 and 1000 tools. Tool-universe growth primarily reduces TSA for the baseline, which cascades to lower OSR; GEPA partially mitigates this via global instruction refinement, while JTPRO provides the most consistent improvements in OSR by jointly refining global instructions and t… view at source ↗

**Figure 8.** Figure 8: ETID performance across supervision levels. Grouped bars show TSA, SFA (conditional on TSA), and OSR for three models under Train-1ex/2ex/4ex regimes. Baselines achieve high TSA but substantially lower OSR, revealing slot/value errors as the dominant failure mode; JTPRO improves SFA and therefore OSR consistently across regimes, while GEPA primarily improves TSA for larger models. D Additional Figure Discu… view at source ↗

**Figure 9.** Figure 9: Parameter frequency across tools. The distribution is heavy-tailed: a small number of parameter families, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Example of globalizing slot semantics. JTPRO moves repeated guidance for date/time formatting, [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Per-example slot-filling improvements from JTPRO. For each base model and dataset, we report the average percentage of test instances on which slot filling is more accurate after JTPRO optimization than the corresponding baseline (i.e., per-query slot/value correctness improves). Gains are larger on the complex slot-filling ETID benchmark (e.g., 34.99% for GPT-4o mini) and remain substantial on ToolACE (e… view at source ↗

**Figure 12.** Figure 12: Per-example slot/value corrections after JTPRO (ToolACE-500, GPT-5). Example-wise comparison of slot-filling outcomes on the ToolACE test set with 500 tools for GPT-5: each bar corresponds to a test instance (x-axis indices), highlighting instances where JTPRO fixes previously incorrect slot/value instantiations relative to the baseline. Overall, JTPRO improves slot correctness on 26 out of 121 test examp… view at source ↗

**Figure 13.** Figure 13: Per-example slot/value corrections after JTPRO (ETID, GPT-5). Example-wise comparison of slot-filling outcomes on the ETID test set for GPT-5: each bar corresponds to a test instance (x-axis indices), highlighting instances where JTPRO fixes previously incorrect slot/value instantiations relative to the baseline. Overall, JTPRO improves slot correctness on 94 out of 403 test examples (23.33%) [PITH_FULL_… view at source ↗

**Figure 14.** Figure 14: JTPRO convergence on validation OSR. Validation OSR over optimization iterations for three base models (GPT-4o mini, o3-mini, GPT-5); ⋆ marks final test OSR for the best validation-selected context. OSR rises quickly in early iterations and then plateaus, indicating rapid correction of high-impact errors followed by smaller refinements that transfer to held-out data. 20 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 15.** Figure 15: Tool description length analysis. (a) Distribution of description lengths before and after optimization. (b) [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Intra-group cosine similarity for the top 15 confusable tool groups. Lower values indicate stronger [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Large language model (LLM) agents augmented with external tools often struggle as number of tools grow large and become domain-specific. In such settings, ambiguous tool descriptions and under-specified agent instructions frequently lead to tool mis-selection and incorrect slot/value instantiation. We hypothesize that this is due to two root causes: generic, one-size-fits-all prompts that ignore tool-specific nuances, and underspecified tool schemas that lack clear guidance on when and how to use each tool and how to format its parameters. We introduce Joint Tool-Prompt Reflective Optimization (JTPRO), a framework for improving tool-calling reliability in trace-supervised settings by iteratively using rollout-driven reflection to co-optimize global instructions and per-tool schema/argument descriptions for accurate tool selection and argument instantiation in large tool inventories. JTPRO is designed to preserve only tool-local cues needed for correct disambiguation and slot filling. We evaluate JTPRO across multi-tool benchmarks, which account for different number of tools using three metrics: Tool Selection Accuracy (TSA), Slot Filling Accuracy(SFA), and Overall Success Rate(OSR) (correct tool + correct slots + correct values). JTPRO consistently outperforms strong baselines, including CoT-style agents, and reflective prompt optimizers such as GEPA by 5%-20% (relative) on OSR. Ablations show that joint optimization of instructions and tool schemas is more effective and robust than optimizing either component in isolation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JTPRO runs a joint reflection loop to tweak both global instructions and per-tool schemas, reporting 5-20% OSR lifts, but the abstract leaves the data partitioning for optimization versus test unclear.

read the letter

The main point is that this paper describes a procedural loop called JTPRO that uses rollout reflections to co-optimize the agent's overall instructions and the individual tool descriptions at the same time. It targets the practical case where lots of similar tools lead to selection errors and bad argument fills because the schemas are too vague or the prompt is too generic. The joint part is presented as the distinguishing move over prior reflective prompt work like GEPA, and the ablations claim the combined updates beat tuning either component alone on the three metrics they track: tool selection accuracy, slot filling accuracy, and overall success rate.

Referee Report

3 major / 2 minor

Summary. The paper introduces JTPRO, a framework for LLM agents that iteratively co-optimizes global instructions and per-tool schema/argument descriptions via rollout-driven reflection in trace-supervised settings. It targets ambiguous tool descriptions and underspecified instructions as root causes of tool mis-selection and slot-filling errors in large tool inventories, and reports consistent 5-20% relative gains in Overall Success Rate (OSR) over CoT-style agents and reflective optimizers such as GEPA across multi-tool benchmarks, with ablations favoring joint over isolated optimization.

Significance. If the empirical gains are robust, the work offers a concrete, iterative optimization procedure that preserves tool-local cues while improving disambiguation and instantiation accuracy; this could meaningfully advance reliable tool use in domain-specific agent settings where one-size-fits-all prompts fail.

major comments (3)

[Experiments] Experiments section (and associated tables): the reported 5-20% relative OSR improvements are presented without any description of how benchmark queries or rollout traces used during iterative reflection and prompt updates are partitioned from the final held-out evaluation set. This directly bears on the central claim that gains arise from reliable correction rather than memorization of observed error patterns.
[Ablations] Ablation studies: while joint optimization of instructions and tool schemas is claimed to be more effective than isolated optimization, the paper provides no variance estimates across independent runs, no statistical significance tests, and no error bars on the TSA/SFA/OSR metrics, making it impossible to assess whether the reported differences are reliable or could be due to optimization stochasticity.
[Method] Method section (reflection procedure): the claim that rollout-driven reflection reliably identifies and corrects ambiguous descriptions without introducing new biases rests on the assumption that the reflection traces are representative and that the process does not overfit to the specific error distributions seen during optimization; no analysis of pre/post error types or failure modes on held-out data is supplied to support this.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the number of tools and domains in each benchmark to allow readers to gauge the scale at which the method is tested.
[Evaluation Metrics] Notation for the three metrics (TSA, SFA, OSR) is introduced but the precise definition of 'correct values' in OSR is not formalized; a short equation or pseudocode would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each of the major comments point by point below, and we will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section (and associated tables): the reported 5-20% relative OSR improvements are presented without any description of how benchmark queries or rollout traces used during iterative reflection and prompt updates are partitioned from the final held-out evaluation set. This directly bears on the central claim that gains arise from reliable correction rather than memorization of observed error patterns.

Authors: We agree that clarifying the data partitioning is essential to substantiate our claims. In the original experimental design, the iterative reflection and optimization were performed exclusively on a designated training subset of the benchmarks, with rollout traces generated only from those queries. The final reported metrics are computed on a completely held-out test set that was never used in any optimization step. We will revise the Experiments section to explicitly describe this train/test split, including the proportions used and confirmation that no test data influenced the prompt or tool description updates. This addition will directly address concerns about potential memorization. revision: yes
Referee: [Ablations] Ablation studies: while joint optimization of instructions and tool schemas is claimed to be more effective than isolated optimization, the paper provides no variance estimates across independent runs, no statistical significance tests, and no error bars on the TSA/SFA/OSR metrics, making it impossible to assess whether the reported differences are reliable or could be due to optimization stochasticity.

Authors: This is a valid point regarding the robustness of our ablation results. Due to the computational cost of running the full optimization pipeline, our initial experiments were limited to single runs. However, we recognize the importance of statistical rigor. In the revised version, we will perform the ablations across multiple independent runs (at least three) using different random seeds for any stochastic components in the reflection or LLM calls. We will report means, standard deviations, and error bars for TSA, SFA, and OSR, and include statistical significance tests (e.g., Wilcoxon signed-rank test) comparing joint vs. isolated optimization. These updates will be added to the Ablations subsection. revision: yes
Referee: [Method] Method section (reflection procedure): the claim that rollout-driven reflection reliably identifies and corrects ambiguous descriptions without introducing new biases rests on the assumption that the reflection traces are representative and that the process does not overfit to the specific error distributions seen during optimization; no analysis of pre/post error types or failure modes on held-out data is supplied to support this.

Authors: We thank the referee for highlighting this aspect of our method's validation. While the current manuscript focuses on overall performance metrics, we agree that a finer-grained error analysis would strengthen the claims. In the revision, we will add a new subsection in the Experiments or Method section that analyzes the distribution of error types (tool selection errors, slot-filling errors, value instantiation errors) on the held-out test set before and after applying JTPRO. This will demonstrate that the reflection-driven updates primarily reduce ambiguous description-related errors without increasing other failure modes, thereby supporting that the process does not introduce new biases. We will also discuss the representativeness of the traces used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is procedural and benchmark-evaluated

full rationale

The paper defines JTPRO as an iterative, rollout-driven procedure for co-optimizing global instructions and per-tool schemas in trace-supervised settings. No equations, fitted parameters, or derivations appear that would reduce the reported TSA/SFA/OSR gains to inputs by construction. Evaluation uses external multi-tool benchmarks with held-out metrics and ablations for joint vs. isolated optimization; claims rest on empirical comparisons to baselines like CoT and GEPA rather than self-referential definitions or load-bearing self-citations. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the assumption that reflection on rollout traces can surface and correct prompt and schema deficiencies; no explicit free parameters, mathematical axioms, or new physical entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5604 in / 1151 out tokens · 39166 ms · 2026-05-10T04:43:03.026462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 19 canonical work pages

[1]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

2023
[2]

Frontiers in Artificial Intelligence , VOLUME=

Sécheresse, Xavier and Guilbert–Ly, Jacques-Yves and Villedieu de Torcy, Antoine , TITLE=. Frontiers in Artificial Intelligence , VOLUME=. 2025 , URL=. doi:10.3389/frai.2025.1613007 , ISSN=

work page doi:10.3389/frai.2025.1613007 2025
[3]

The Twelfth International Conference on Learning Representations , year=

Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=
[4]

2025 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=

2025
[5]

2021 , eprint=

Multi-objective Asynchronous Successive Halving , author=. 2021 , eprint=

2021
[6]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[7]

2023 , html =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

2023
[8]

2023 , eprint=

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models , author=. 2023 , eprint=

2023
[9]

The Twelfth International Conference on Learning Representations , year=

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. The Twelfth International Conference on Learning Representations , year=
[10]

arXiv e-prints , keywords =

AutoPDL: Automatic Prompt Optimization for LLM Agents. arXiv e-prints , keywords =. doi:10.48550/arXiv.2504.04365 , archivePrefix =. 2504.04365 , primaryClass =

work page doi:10.48550/arxiv.2504.04365
[11]

Logan IV, Eric Wallace, and Sameer Singh

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[12]

doi: 10.18653/v1/2021.emnlp-main.243

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[13]

Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search

Pryzant, Reid and Iter, Dan and Li, Jerry and Lee, Yin and Zhu, Chenguang and Zeng, Michael. Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.494

work page doi:10.18653/v1/2023.emnlp-main.494 2023
[14]

Nature , volume=

Optimizing generative AI by backpropagating language model feedback , author=. Nature , volume=
[15]

The Twelfth International Conference on Learning Representations , year=

Large Language Models as Optimizers , author=. The Twelfth International Conference on Learning Representations , year=
[16]

Optimizing instructions and demonstrations for multi-stage language model programs

Opsahl-Ong, Krista and Ryan, Michael J and Purtell, Josh and Broman, David and Potts, Christopher and Zaharia, Matei and Khattab, Omar. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.525

work page doi:10.18653/v1/2024.emnlp-main.525 2024
[17]

PR ompt Optimization in Multi-Step Tasks ( PROMST ): Integrating Human Feedback and Heuristic-based Sampling

Chen, Yongchao and Arkin, Jacob and Hao, Yilun and Zhang, Yang and Roy, Nicholas and Fan, Chuchu. PR ompt Optimization in Multi-Step Tasks ( PROMST ): Integrating Human Feedback and Heuristic-based Sampling. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.226

work page doi:10.18653/v1/2024.emnlp-main.226 2024
[18]

First Workshop on Long-Context Foundation Models @ ICML 2024 , year=

PhaseEvo: Towards Unified Long-Context Prompt Optimization for Large Language Models , author=. First Workshop on Long-Context Foundation Models @ ICML 2024 , year=

2024
[19]

LLM Agents Making Agent Tools

W. LLM Agents Making Agent Tools. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1266

work page doi:10.18653/v1/2025.acl-long.1266 2025
[20]

Hao, Yupu and Cao, Pengfei and Jin, Zhuoran and Liao, Huanxuan and Chen, Yubo and Liu, Kang and Zhao, Jun , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence , arti...

work page doi:10.1609/aaai.v39i22.34573 2025
[21]

Ioannidis and Karthik Subbian and Jure Leskovec and James Zou , booktitle=

Shirley Wu and Shiyu Zhao and Qian Huang and Kexin Huang and Michihiro Yasunaga and Kaidi Cao and Vassilis N. Ioannidis and Karthik Subbian and Jure Leskovec and James Zou , booktitle=. AvaTaR: Optimizing. 2024 , url=

2024
[22]

Uncertainty Calibration for Tool-Using Language Agents

Liu, Hao and Dou, Zi-Yi and Wang, Yixin and Peng, Nanyun and Yue, Yisong. Uncertainty Calibration for Tool-Using Language Agents. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.978

work page doi:10.18653/v1/2024.findings-emnlp.978 2024
[23]

Procedural environment generation for tool- use agents

Sullivan, Michael and Hartmann, Mareike and Koller, Alexander. Procedural Environment Generation for Tool-Use Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.936

work page doi:10.18653/v1/2025.emnlp-main.936 2025
[24]

2024 , eprint=

Enhancing Tool Retrieval with Iterative Feedback from Large Language Models , author=. 2024 , eprint=

2024
[25]

Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark

Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark , author =. arXiv preprint arXiv:2405.08355 , year =

work page arXiv
[26]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1
[27]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023
[28]

2023 , eprint=

Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=

2023
[29]

2024 , eprint=

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models , author=. 2024 , eprint=

2024
[30]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

2023
[31]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[32]

2025 , eprint=

From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions , author=. 2025 , eprint=

2025
[33]

A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents

Wu, Bin and Meij, Edgar and Yilmaz, Emine. A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1149

work page doi:10.18653/v1/2025.findings-acl.1149 2025
[34]

2025 , eprint=

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , author=. 2025 , eprint=

2025
[35]

2025 , eprint=

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models , author=. 2025 , eprint=

2025
[36]

2024 , eprint=

Tool Learning with Foundation Models , author=. 2024 , eprint=

2024
[37]

2025 , eprint=

ToolACE: Winning the Points of LLM Function Calling , author=. 2025 , eprint=

2025
[38]

2025 , eprint=

ToolRL: Reward is All Tool Learning Needs , author=. 2025 , eprint=

2025
[39]

Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs

Singh, Jyotika and Sun, Weiyi and Agarwal, Amit and Krishnamurthy, Viji and Benajiba, Yassine and Ravi, Sujith and Roth, Dan. Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 20...

work page doi:10.18653/v1/2025.emnlp-industry.60 2025
[40]

2025 , eprint=

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs , author=. 2025 , eprint=

2025
[41]

Meta-Reasoning Improves Tool Use in Large Language Models , url=

Alazraki, Lisa and Rei, Marek , year=. Meta-Reasoning Improves Tool Use in Large Language Models , url=. doi:10.18653/v1/2025.findings-naacl.440 , booktitle=

work page doi:10.18653/v1/2025.findings-naacl.440 2025
[42]

2024 , eprint=

Structure-aware Fine-tuning for Code Pre-trained Models , author=. 2024 , eprint=

2024
[43]

2023 , eprint=

RestGPT: Connecting Large Language Models with Real-World RESTful APIs , author=. 2023 , eprint=

2023
[44]

2023 , eprint=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. 2023 , eprint=

2023
[45]

2024 , eprint=

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets , author=. 2024 , eprint=

2024
[46]

2024 , eprint=

ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval , author=. 2024 , eprint=

2024
[47]

Natural Language Processing in the Real World: Text Processing, Analytics, and Classification , ISBN =

Singh, Jyotika , year =. Natural Language Processing in the Real World: Text Processing, Analytics, and Classification , ISBN =. doi:10.1201/9781003264774 , publisher =

work page doi:10.1201/9781003264774
[48]

Towards Completeness-Oriented Tool Retrieval for Large Language Models , url=

Qu, Changle and Dai, Sunhao and Wei, Xiaochi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Xu, Jun and Wen, Ji-Rong , year=. Towards Completeness-Oriented Tool Retrieval for Large Language Models , url=. doi:10.1145/3627673.3679847 , booktitle=

work page doi:10.1145/3627673.3679847
[49]

Agentic AI Across Domains: A Comprehensive Review of Capabilities, Applications, and Future Directions , volume =

Zhang, Sida , year =. Agentic AI Across Domains: A Comprehensive Review of Capabilities, Applications, and Future Directions , volume =. Journal of Computing Innovations and Applications , doi =
[50]

2026 , eprint=

MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation , author=. 2026 , eprint=

2026
[51]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
[52]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

Hard negative mining for domain-specific retrieval in enterprise systems , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
[53]

Aligning LLM s for Multilingual Consistency in Enterprise Applications

Agarwal, Amit and Meghwani, Hansa and Patel, Hitesh Laxmichand and Sheng, Tao and Ravi, Sujith and Roth, Dan. Aligning LLM s for Multilingual Consistency in Enterprise Applications. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.9

work page doi:10.18653/v1/2025.emnlp-industry.9 2025
[54]

Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing , pages=

Hybrid ai for responsive multi-turn online conversations with novel dynamic routing and feedback adaptation , author=. Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing , pages=
[55]

2025 , eprint=

Maestro: Joint Graph & Config Optimization for Reliable AI Agents , author=. 2025 , eprint=

2025
[56]

2025 , eprint=

Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison , author=. 2025 , eprint=

2025
[57]

2026 , eprint=

DiffuMask: Diffusion Language Model for Token-level Prompt Pruning , author=. 2026 , eprint=

2026