Recognition: unknown
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents
Pith reviewed 2026-05-10 04:43 UTC · model grok-4.3
The pith
Jointly refining agent instructions and tool descriptions via rollout reflection improves tool selection and slot filling for LLM agents with large tool sets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JTPRO is a framework that iteratively applies rollout-driven reflection to co-optimize global instructions and per-tool schema and argument descriptions for accurate tool selection and argument instantiation in large tool inventories. The method is designed to retain only tool-local cues required for disambiguation and slot filling. Across benchmarks that vary the number of tools, JTPRO raises overall success rate by 5 to 20 percent relative to strong baselines such as CoT-style agents and reflective prompt optimizers. Separate ablations confirm that optimizing instructions and tool schemas together yields more effective and robust gains than optimizing either component in isolation.
What carries the argument
Joint Tool-Prompt Reflective Optimization, an iterative process that co-optimizes a shared instruction set and individual tool schemas by reflecting on rollout traces to correct selection and instantiation errors.
If this is right
- Joint optimization of instructions and tool schemas produces higher overall success than optimizing either one separately.
- The framework applies in trace-supervised settings where rollout outcomes supply the signal for reflection.
- Gains appear across benchmarks that differ in total number of tools.
- Only tool-local cues are retained, reducing the risk of over-generalizing changes.
- The approach targets both tool selection accuracy and slot-filling accuracy as separate but related improvements.
Where Pith is reading between the lines
- If reflection on traces proves reliable, similar joint optimization loops could be applied to other agent components such as planning steps or memory retrieval.
- The method may lower the cost of deploying agents in new domains by automating prompt and schema tuning from a modest set of example traces.
- A natural extension would test whether the refined schemas transfer to previously unseen tools without further reflection.
- The current results rest on access to evaluation rollouts, so performance in fully open-ended agent interactions without traces remains an open question.
Load-bearing premise
Ambiguous tool descriptions and under-specified agent instructions are the primary causes of mis-selection and incorrect slot filling, and rollout-driven reflection can reliably identify and fix them without adding new biases or overfitting to the evaluation traces.
What would settle it
Run JTPRO on a fresh benchmark where tools have deliberately ambiguous descriptions and under-specified schemas; if overall success rate does not rise or falls below the CoT baseline, the central claim would be falsified.
Figures
read the original abstract
Large language model (LLM) agents augmented with external tools often struggle as number of tools grow large and become domain-specific. In such settings, ambiguous tool descriptions and under-specified agent instructions frequently lead to tool mis-selection and incorrect slot/value instantiation. We hypothesize that this is due to two root causes: generic, one-size-fits-all prompts that ignore tool-specific nuances, and underspecified tool schemas that lack clear guidance on when and how to use each tool and how to format its parameters. We introduce Joint Tool-Prompt Reflective Optimization (JTPRO), a framework for improving tool-calling reliability in trace-supervised settings by iteratively using rollout-driven reflection to co-optimize global instructions and per-tool schema/argument descriptions for accurate tool selection and argument instantiation in large tool inventories. JTPRO is designed to preserve only tool-local cues needed for correct disambiguation and slot filling. We evaluate JTPRO across multi-tool benchmarks, which account for different number of tools using three metrics: Tool Selection Accuracy (TSA), Slot Filling Accuracy(SFA), and Overall Success Rate(OSR) (correct tool + correct slots + correct values). JTPRO consistently outperforms strong baselines, including CoT-style agents, and reflective prompt optimizers such as GEPA by 5%-20% (relative) on OSR. Ablations show that joint optimization of instructions and tool schemas is more effective and robust than optimizing either component in isolation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JTPRO, a framework for LLM agents that iteratively co-optimizes global instructions and per-tool schema/argument descriptions via rollout-driven reflection in trace-supervised settings. It targets ambiguous tool descriptions and underspecified instructions as root causes of tool mis-selection and slot-filling errors in large tool inventories, and reports consistent 5-20% relative gains in Overall Success Rate (OSR) over CoT-style agents and reflective optimizers such as GEPA across multi-tool benchmarks, with ablations favoring joint over isolated optimization.
Significance. If the empirical gains are robust, the work offers a concrete, iterative optimization procedure that preserves tool-local cues while improving disambiguation and instantiation accuracy; this could meaningfully advance reliable tool use in domain-specific agent settings where one-size-fits-all prompts fail.
major comments (3)
- [Experiments] Experiments section (and associated tables): the reported 5-20% relative OSR improvements are presented without any description of how benchmark queries or rollout traces used during iterative reflection and prompt updates are partitioned from the final held-out evaluation set. This directly bears on the central claim that gains arise from reliable correction rather than memorization of observed error patterns.
- [Ablations] Ablation studies: while joint optimization of instructions and tool schemas is claimed to be more effective than isolated optimization, the paper provides no variance estimates across independent runs, no statistical significance tests, and no error bars on the TSA/SFA/OSR metrics, making it impossible to assess whether the reported differences are reliable or could be due to optimization stochasticity.
- [Method] Method section (reflection procedure): the claim that rollout-driven reflection reliably identifies and corrects ambiguous descriptions without introducing new biases rests on the assumption that the reflection traces are representative and that the process does not overfit to the specific error distributions seen during optimization; no analysis of pre/post error types or failure modes on held-out data is supplied to support this.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly state the number of tools and domains in each benchmark to allow readers to gauge the scale at which the method is tested.
- [Evaluation Metrics] Notation for the three metrics (TSA, SFA, OSR) is introduced but the precise definition of 'correct values' in OSR is not formalized; a short equation or pseudocode would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each of the major comments point by point below, and we will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and associated tables): the reported 5-20% relative OSR improvements are presented without any description of how benchmark queries or rollout traces used during iterative reflection and prompt updates are partitioned from the final held-out evaluation set. This directly bears on the central claim that gains arise from reliable correction rather than memorization of observed error patterns.
Authors: We agree that clarifying the data partitioning is essential to substantiate our claims. In the original experimental design, the iterative reflection and optimization were performed exclusively on a designated training subset of the benchmarks, with rollout traces generated only from those queries. The final reported metrics are computed on a completely held-out test set that was never used in any optimization step. We will revise the Experiments section to explicitly describe this train/test split, including the proportions used and confirmation that no test data influenced the prompt or tool description updates. This addition will directly address concerns about potential memorization. revision: yes
-
Referee: [Ablations] Ablation studies: while joint optimization of instructions and tool schemas is claimed to be more effective than isolated optimization, the paper provides no variance estimates across independent runs, no statistical significance tests, and no error bars on the TSA/SFA/OSR metrics, making it impossible to assess whether the reported differences are reliable or could be due to optimization stochasticity.
Authors: This is a valid point regarding the robustness of our ablation results. Due to the computational cost of running the full optimization pipeline, our initial experiments were limited to single runs. However, we recognize the importance of statistical rigor. In the revised version, we will perform the ablations across multiple independent runs (at least three) using different random seeds for any stochastic components in the reflection or LLM calls. We will report means, standard deviations, and error bars for TSA, SFA, and OSR, and include statistical significance tests (e.g., Wilcoxon signed-rank test) comparing joint vs. isolated optimization. These updates will be added to the Ablations subsection. revision: yes
-
Referee: [Method] Method section (reflection procedure): the claim that rollout-driven reflection reliably identifies and corrects ambiguous descriptions without introducing new biases rests on the assumption that the reflection traces are representative and that the process does not overfit to the specific error distributions seen during optimization; no analysis of pre/post error types or failure modes on held-out data is supplied to support this.
Authors: We thank the referee for highlighting this aspect of our method's validation. While the current manuscript focuses on overall performance metrics, we agree that a finer-grained error analysis would strengthen the claims. In the revision, we will add a new subsection in the Experiments or Method section that analyzes the distribution of error types (tool selection errors, slot-filling errors, value instantiation errors) on the held-out test set before and after applying JTPRO. This will demonstrate that the reflection-driven updates primarily reduce ambiguous description-related errors without increasing other failure modes, thereby supporting that the process does not introduce new biases. We will also discuss the representativeness of the traces used. revision: yes
Circularity Check
No significant circularity; framework is procedural and benchmark-evaluated
full rationale
The paper defines JTPRO as an iterative, rollout-driven procedure for co-optimizing global instructions and per-tool schemas in trace-supervised settings. No equations, fitted parameters, or derivations appear that would reduce the reported TSA/SFA/OSR gains to inputs by construction. Evaluation uses external multi-tool benchmarks with held-out metrics and ablations for joint vs. isolated optimization; claims rest on empirical comparisons to baselines like CoT and GEPA rather than self-referential definitions or load-bearing self-citations. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...
2023
-
[2]
Frontiers in Artificial Intelligence , VOLUME=
Sécheresse, Xavier and Guilbert–Ly, Jacques-Yves and Villedieu de Torcy, Antoine , TITLE=. Frontiers in Artificial Intelligence , VOLUME=. 2025 , URL=. doi:10.3389/frai.2025.1613007 , ISSN=
-
[3]
The Twelfth International Conference on Learning Representations , year=
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. The Twelfth International Conference on Learning Representations , year=
-
[4]
2025 , eprint=
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2025 , eprint=
2025
-
[5]
2021 , eprint=
Multi-objective Asynchronous Successive Halving , author=. 2021 , eprint=
2021
-
[6]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[7]
2023 , html =
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =
2023
-
[8]
2023 , eprint=
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models , author=. 2023 , eprint=
2023
-
[9]
The Twelfth International Conference on Learning Representations , year=
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines , author=. The Twelfth International Conference on Learning Representations , year=
-
[10]
AutoPDL: Automatic Prompt Optimization for LLM Agents. arXiv e-prints , keywords =. doi:10.48550/arXiv.2504.04365 , archivePrefix =. 2504.04365 , primaryClass =
-
[11]
Logan IV, Eric Wallace, and Sameer Singh
Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346
-
[12]
doi: 10.18653/v1/2021.emnlp-main.243
Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243
-
[13]
Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search
Pryzant, Reid and Iter, Dan and Li, Jerry and Lee, Yin and Zhu, Chenguang and Zeng, Michael. Automatic Prompt Optimization with ``Gradient Descent'' and Beam Search. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.494
-
[14]
Nature , volume=
Optimizing generative AI by backpropagating language model feedback , author=. Nature , volume=
-
[15]
The Twelfth International Conference on Learning Representations , year=
Large Language Models as Optimizers , author=. The Twelfth International Conference on Learning Representations , year=
-
[16]
Optimizing instructions and demonstrations for multi-stage language model programs
Opsahl-Ong, Krista and Ryan, Michael J and Purtell, Josh and Broman, David and Potts, Christopher and Zaharia, Matei and Khattab, Omar. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.525
-
[17]
Chen, Yongchao and Arkin, Jacob and Hao, Yilun and Zhang, Yang and Roy, Nicholas and Fan, Chuchu. PR ompt Optimization in Multi-Step Tasks ( PROMST ): Integrating Human Feedback and Heuristic-based Sampling. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.226
-
[18]
First Workshop on Long-Context Foundation Models @ ICML 2024 , year=
PhaseEvo: Towards Unified Long-Context Prompt Optimization for Large Language Models , author=. First Workshop on Long-Context Foundation Models @ ICML 2024 , year=
2024
-
[19]
W. LLM Agents Making Agent Tools. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1266
-
[20]
Hao, Yupu and Cao, Pengfei and Jin, Zhuoran and Liao, Huanxuan and Chen, Yubo and Liu, Kang and Zhao, Jun , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence , arti...
-
[21]
Ioannidis and Karthik Subbian and Jure Leskovec and James Zou , booktitle=
Shirley Wu and Shiyu Zhao and Qian Huang and Kexin Huang and Michihiro Yasunaga and Kaidi Cao and Vassilis N. Ioannidis and Karthik Subbian and Jure Leskovec and James Zou , booktitle=. AvaTaR: Optimizing. 2024 , url=
2024
-
[22]
Uncertainty Calibration for Tool-Using Language Agents
Liu, Hao and Dou, Zi-Yi and Wang, Yixin and Peng, Nanyun and Yue, Yisong. Uncertainty Calibration for Tool-Using Language Agents. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.978
-
[23]
Procedural environment generation for tool- use agents
Sullivan, Michael and Hartmann, Mareike and Koller, Alexander. Procedural Environment Generation for Tool-Use Agents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.936
-
[24]
2024 , eprint=
Enhancing Tool Retrieval with Iterative Feedback from Large Language Models , author=. 2024 , eprint=
2024
-
[25]
Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark
Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark , author =. arXiv preprint arXiv:2405.08355 , year =
-
[26]
A survey on large language model based autonomous agents , volume=
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...
-
[27]
2023 , eprint=
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=
2023
-
[28]
2023 , eprint=
Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=
2023
-
[29]
2024 , eprint=
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models , author=. 2024 , eprint=
2024
-
[30]
2023 , eprint=
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
2023
-
[31]
2023 , eprint=
ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=
2023
-
[32]
2025 , eprint=
From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions , author=. 2025 , eprint=
2025
-
[33]
A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents
Wu, Bin and Meij, Edgar and Yilmaz, Emine. A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1149
-
[34]
2025 , eprint=
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory , author=. 2025 , eprint=
2025
-
[35]
2025 , eprint=
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models , author=. 2025 , eprint=
2025
-
[36]
2024 , eprint=
Tool Learning with Foundation Models , author=. 2024 , eprint=
2024
-
[37]
2025 , eprint=
ToolACE: Winning the Points of LLM Function Calling , author=. 2025 , eprint=
2025
-
[38]
2025 , eprint=
ToolRL: Reward is All Tool Learning Needs , author=. 2025 , eprint=
2025
-
[39]
Singh, Jyotika and Sun, Weiyi and Agarwal, Amit and Krishnamurthy, Viji and Benajiba, Yassine and Ravi, Sujith and Roth, Dan. Can LLM s Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to- SQL System Outputs. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 20...
-
[40]
2025 , eprint=
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs , author=. 2025 , eprint=
2025
-
[41]
Meta-Reasoning Improves Tool Use in Large Language Models , url=
Alazraki, Lisa and Rei, Marek , year=. Meta-Reasoning Improves Tool Use in Large Language Models , url=. doi:10.18653/v1/2025.findings-naacl.440 , booktitle=
-
[42]
2024 , eprint=
Structure-aware Fine-tuning for Code Pre-trained Models , author=. 2024 , eprint=
2024
-
[43]
2023 , eprint=
RestGPT: Connecting Large Language Models with Real-World RESTful APIs , author=. 2023 , eprint=
2023
-
[44]
2023 , eprint=
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. 2023 , eprint=
2023
-
[45]
2024 , eprint=
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets , author=. 2024 , eprint=
2024
-
[46]
2024 , eprint=
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval , author=. 2024 , eprint=
2024
-
[47]
Singh, Jyotika , year =. Natural Language Processing in the Real World: Text Processing, Analytics, and Classification , ISBN =. doi:10.1201/9781003264774 , publisher =
-
[48]
Towards Completeness-Oriented Tool Retrieval for Large Language Models , url=
Qu, Changle and Dai, Sunhao and Wei, Xiaochi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Xu, Jun and Wen, Ji-Rong , year=. Towards Completeness-Oriented Tool Retrieval for Large Language Models , url=. doi:10.1145/3627673.3679847 , booktitle=
-
[49]
Agentic AI Across Domains: A Comprehensive Review of Capabilities, Applications, and Future Directions , volume =
Zhang, Sida , year =. Agentic AI Across Domains: A Comprehensive Review of Capabilities, Applications, and Future Directions , volume =. Journal of Computing Innovations and Applications , doi =
-
[50]
2026 , eprint=
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation , author=. 2026 , eprint=
2026
-
[51]
and Kaiser,
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
-
[52]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
Hard negative mining for domain-specific retrieval in enterprise systems , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
-
[53]
Aligning LLM s for Multilingual Consistency in Enterprise Applications
Agarwal, Amit and Meghwani, Hansa and Patel, Hitesh Laxmichand and Sheng, Tao and Ravi, Sujith and Roth, Dan. Aligning LLM s for Multilingual Consistency in Enterprise Applications. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025. doi:10.18653/v1/2025.emnlp-industry.9
-
[54]
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing , pages=
Hybrid ai for responsive multi-turn online conversations with novel dynamic routing and feedback adaptation , author=. Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing , pages=
-
[55]
2025 , eprint=
Maestro: Joint Graph & Config Optimization for Reliable AI Agents , author=. 2025 , eprint=
2025
-
[56]
2025 , eprint=
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison , author=. 2025 , eprint=
2025
-
[57]
2026 , eprint=
DiffuMask: Diffusion Language Model for Token-level Prompt Pruning , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.