pith. machine review for the scientific record. sign in

arxiv: 2304.08244 · v2 · submitted 2023-04-14 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords API-Banktool-augmented LLMsbenchmarkLynx modelAPI callingtool utilizationLLM fine-tuningGPT-3.5 comparison
0
0 comments X

The pith

The API-Bank benchmark reveals that training Lynx on tool-use dialogues lets it surpass Alpaca by over 26 points and approach GPT-3.5 in using external APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces API-Bank to measure how well large language models can plan, retrieve, and call external tools. It evaluates existing models like GPT-3, GPT-3.5, and GPT-4 on 314 annotated dialogues involving 73 APIs. Then it provides a large training set of nearly 2,000 dialogues to fine-tune the Alpaca model into Lynx, which shows major gains in tool utilization performance. This addresses the questions of current effectiveness, methods for improvement, and remaining obstacles through error analysis.

Core claim

API-Bank provides a runnable evaluation system with 73 API tools and 314 tool-use dialogues containing 753 API calls to assess LLMs in planning, retrieving, and calling APIs. A training set of 1,888 tool-use dialogues from 2,138 APIs across 1,000 domains is used to train Lynx from Alpaca, resulting in Lynx outperforming Alpaca by more than 26 points and approaching GPT-3.5 effectiveness, while GPT-4 excels in planning but all models have room for improvement.

What carries the argument

The API-Bank evaluation system consisting of 73 runnable APIs and annotated dialogues that measures planning, retrieval, and calling accuracy, plus the associated training dataset used to create the Lynx model.

If this is right

  • GPT-3.5 demonstrates better tool utilization than GPT-3.
  • GPT-4 shows superior planning abilities compared to other models.
  • Significant potential remains for further improvements in tool-augmented LLMs.
  • Error analysis identifies specific challenges like accurate API retrieval and handling complex planning for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Expanding the benchmark to more APIs and domains could make tool-augmented models more practical for everyday applications.
  • Models like Lynx might integrate into systems that automatically select and use APIs without human intervention.
  • Similar benchmarks could be developed for other capabilities like code execution or web browsing to advance agent-like LLMs.

Load-bearing premise

The selected 73 APIs and 314 dialogues are representative of real tool-use scenarios and the automatic evaluation correctly measures the accuracy of planning, retrieval, and API calling.

What would settle it

A follow-up study that tests Lynx and other models on a new set of previously unseen APIs or real-world tasks where the automatic scores do not align with human judgments of successful tool use.

read the original abstract

Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces API-Bank, a benchmark for tool-augmented LLMs consisting of a runnable evaluation system with 73 APIs and 314 annotated dialogues (753 API calls) to measure planning, retrieval, and calling performance, plus a training set of 1,888 dialogues from 2,138 APIs across 1,000 domains. It fine-tunes Lynx from Alpaca on this data and reports that Lynx improves tool utilization over Alpaca by more than 26 points while approaching GPT-3.5, with additional analysis of GPT-3/3.5/4 capabilities and remaining challenges.

Significance. If the automatic evaluation proves reliable, the work supplies a concrete, runnable benchmark and training resource that directly addresses the gap in standardized tool-use evaluation for LLMs. The scale of the datasets, the multi-aspect breakdown (planning/retrieval/calling), and the demonstration that targeted fine-tuning yields substantial gains over the Alpaca baseline would make this a useful reference point for subsequent research on tool-augmented models.

major comments (2)
  1. [Evaluation] Evaluation system (abstract and § on evaluation): the headline claim that Lynx surpasses Alpaca by >26 pts and approaches GPT-3.5 rests entirely on an automatic scorer whose implementation details, handling of API failures, edge cases, and agreement with human judgments are not reported. Without inter-annotator agreement statistics or validation of the scorer on the 314-dialogue test set, the numeric improvement cannot be trusted as load-bearing evidence.
  2. [Dataset] Dataset construction (abstract): the 73 chosen APIs and 314 annotated dialogues are presented as representative of realistic tool-use scenarios, yet no external corroboration, coverage analysis, or comparison to real-world distributions is supplied. This assumption directly affects the generalizability of both the benchmark results and the Lynx training gains.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'more than 26 pts' should explicitly name the metric (e.g., success rate, F1) and the exact baseline score for Alpaca.
  2. [Evaluation] The paper should clarify whether the automatic evaluator credits partial API calls or requires exact matches, as this choice affects interpretation of the reported planning and calling accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. We have revised the manuscript to incorporate additional details and analyses where the comments identify gaps in the original submission.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation system (abstract and § on evaluation): the headline claim that Lynx surpasses Alpaca by >26 pts and approaches GPT-3.5 rests entirely on an automatic scorer whose implementation details, handling of API failures, edge cases, and agreement with human judgments are not reported. Without inter-annotator agreement statistics or validation of the scorer on the 314-dialogue test set, the numeric improvement cannot be trusted as load-bearing evidence.

    Authors: We agree that the reliability of the automatic scorer is central to the reported results and that the original manuscript provided insufficient implementation details. In the revised version, we have substantially expanded the evaluation section to describe the scorer's rule-based logic, including exact handling of API failures (e.g., missing parameters, incorrect types, or non-existent calls), edge cases (partial matches, multiple valid sequences), and scoring rubrics for planning, retrieval, and calling. We have also added a human validation study: three independent annotators scored a random sample of 100 dialogues from the 314-dialogue test set, yielding Cohen's kappa of 0.82 between human judgments and the automatic scorer. These additions directly address the concern and allow readers to assess the metric's trustworthiness. revision: yes

  2. Referee: [Dataset] Dataset construction (abstract): the 73 chosen APIs and 314 annotated dialogues are presented as representative of realistic tool-use scenarios, yet no external corroboration, coverage analysis, or comparison to real-world distributions is supplied. This assumption directly affects the generalizability of both the benchmark results and the Lynx training gains.

    Authors: We acknowledge that the original submission did not include an explicit coverage analysis or comparison against external real-world distributions. The 73 APIs were chosen from popular public repositories to span diverse functional categories (e.g., weather, finance, productivity), and the 314 dialogues were authored to reflect typical multi-turn tool-use patterns. In the revised manuscript, we have added a dedicated subsection under dataset construction that provides a category-level breakdown of the APIs, compares their distribution to those appearing in public API directories and prior tool-use studies, and explicitly discusses limitations in representativeness. We also note that the training set (1,888 dialogues from 2,138 APIs) was constructed with broader coverage to mitigate some of these concerns for the fine-tuning experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and fine-tuning on separate train/test splits

full rationale

The paper constructs API-Bank by annotating 314 evaluation dialogues (with 753 API calls) and a disjoint training set of 1,888 dialogues drawn from 2,138 APIs. Lynx is fine-tuned on the training split and evaluated on the held-out 314-dialogue set using an automatic scorer. All numeric claims (e.g., Lynx > Alpaca by >26 pts) are direct empirical measurements on this split; no equations, predictions, or uniqueness claims reduce to fitted parameters or self-citations by construction. The work contains no derivations, ansatzes, or load-bearing self-references that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that tool use can be meaningfully decomposed into planning, retrieval, and calling steps that are testable via a fixed set of APIs.

axioms (1)
  • domain assumption LLMs can be enhanced by utilizing external tools
    Opening sentence of the abstract frames the entire benchmark around this premise.

pith-pipeline@v0.9.0 · 5577 in / 1161 out tokens · 23612 ms · 2026-05-15T20:47:56.852530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs... Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls...

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mind2Web: Towards a Generalist Agent for the Web

    cs.CL 2023-06 accept novelty 8.0

    Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

  2. Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

    cs.CL 2026-04 unverdicted novelty 7.0

    Chat2Workflow benchmark shows that state-of-the-art LLMs often grasp high-level intent for visual workflow generation but fail to produce correct, stable, executable outputs, with an agentic framework delivering only ...

  3. GraSP: Graph-Structured Skill Compositions for LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...

  4. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  5. MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security

    cs.CR 2026-04 conditional novelty 7.0

    MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.

  6. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  7. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

  8. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

  9. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  10. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  11. PARM: Pipeline-Adapted Reward Model

    cs.AI 2026-04 unverdicted novelty 6.0

    PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

  12. English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

    cs.CL 2026-04 unverdicted novelty 6.0

    Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer...

  13. Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking

    cs.IR 2026-03 unverdicted novelty 6.0

    A new benchmarking study finds moderate but domain-dependent divergence in how LLMs retrieve and rank APIs, with higher disagreement on open-ended tasks.

  14. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  15. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    cs.CL 2023-06 conditional novelty 6.0

    ToolAlpaca trains 7B and 13B models on 3938 simulated tool-use cases to reach generalized tool-use performance comparable to GPT-3.5 on unseen APIs.

  16. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

    cs.AI 2023-05 conditional novelty 6.0

    GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.

  17. Trajectory Supervision for Continual Tool-Use Learning in LLMs

    cs.SE 2026-05 conditional novelty 5.0

    Retaining tool-use trajectories during sequential fine-tuning on API domains improves next-call prediction accuracy by 17.7 points over stripped-history training.

  18. Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

  19. A Periodic Space of Distributed Computing: Vision & Framework

    cs.DC 2026-04 unverdicted novelty 4.0

    A periodic framework is proposed to characterize, compare, and predict behaviors across distributed computing solutions by mapping system properties in a structured space inspired by the chemical periodic table.

  20. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 20 Pith papers · 9 internal anchors

  1. [1]

    Advances in neural information processing systems, 33:1877–1901

    Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. Sébastien Bubeck, Varun Chandrasekaran, Ronen El- dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund- berg, et al

  2. [2]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelli- gence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712. Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou

  3. [3]

    Large language models as tool makers

    Large language models as tool makers. arXiv preprint arXiv:2305.17126. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al

  4. [4]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang

  5. [5]

    arXiv preprint arXiv:2305.11554

    Toolkengpt: Augmenting frozen lan- guage models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave

  6. [6]

    arXiv preprint arXiv:2208.03299

    Few-shot learning with re- trieval augmented language models. arXiv preprint arXiv:2208.03299. Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al

  7. [7]

    Taskmatrix

    Taskmatrix. ai: Com- pleting tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434. Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al

  8. [8]

    Augmented language models: a survey

    Augmented language models: a survey. arXiv preprint arXiv:2302.07842. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al

  9. [9]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332. Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro

  10. [10]

    Art: Automatic multi-step reasoning and tool-use for large language models

    Art: Automatic multi- step reasoning and tool-use for large language mod- els. arXiv preprint arXiv:2303.09014. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

  11. [11]

    Gorilla: Large Language Model Connected with Massive APIs

    Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

  12. [12]

    arXiv preprint arXiv:2305.14318

    Creator: Disentan- gling abstract and concrete reasonings of large lan- guage models through tool creation. arXiv preprint arXiv:2305.14318. Shuofei Qiao, Honghao Gui, Huajun Chen, and Ningyu Zhang

  13. [13]

    arXiv preprint arXiv:2305.13068

    Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. 2023a. Tool learning with foundation models. arXiv preprint arXiv:2304.08354. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan...

  14. [14]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761. Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang

  15. [15]

    arXiv preprint arXiv:2306.17492

    Pref- erence ranking optimization for human alignment. arXiv preprint arXiv:2306.17492. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun

  16. [16]

    ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    Toolalpaca: Gener- alized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi

  18. [18]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Self-instruct: Aligning lan- guage model with self generated instructions. arXiv preprint arXiv:2212.10560. Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang

  19. [19]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vin- cent Vanhoucke, et al

  20. [20]

    arXiv preprint arXiv:2204.00598

    Socratic models: Com- posing zero-shot multimodal reasoning with lan- guage. arXiv preprint arXiv:2204.00598. Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, and Nevin L Zhang

  21. [21]

    arXiv preprint arXiv:2308.05696

    A preliminary study of the intrinsic relationship be- tween complexity and alignment. arXiv preprint arXiv:2308.05696. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang

  22. [22]

    arXiv preprint arXiv:2306.13304

    Toolqa: A dataset for llm ques- tion answering with external tools. arXiv preprint arXiv:2306.13304. A Appendix Generate an API request in the format of [ApiName(key1='value1', key2='value2', ...)] based on the previous dialogue context. The current year is 2023.Input: User: User's utterenceAI: AI's responseExpected output:API-Request: [ApiName(key1='valu...

  23. [23]

    name": "ToolSearcher

    Input: User: User's utterence AI: AI's response Expected output: API-Request: [ApiName(key1='value1', key2='value2', ...)] API descriptions: {"name": "ToolSearcher", "description": "Searches for relevant tools in library based on the keywords.", "input_parameters": {"keywords": {"type": "str", "description": "The keyword to search for."}}, "output_paramet...