super hub Canonical reference

Gorilla: Large Language Model Connected with Massive APIs

Joseph E. Gonzalez, Shishir G. Patil, Tianjun Zhang, Xin Wang · 2023 · cs.CL · arXiv 2305.15334

Canonical reference. 89% of citing Pith papers cite this work as background.

109 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 109 citing papers more from Joseph E. Gonzalez arXiv PDF

abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 27 baseline 1

citation-polarity summary

background 25 unclear 2 baseline 1

claims ledger

abstract Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When

authors

Joseph E. Gonzalez Shishir G. Patil Tianjun Zhang Xin Wang

co-cited works

representative citing papers

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

cs.SE · 2024-03-25 · conditional · novelty 8.0

RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

Mind2Web: Towards a Generalist Agent for the Web

cs.CL · 2023-06-09 · accept · novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Entity Binding Failures in Tool-Augmented Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper defines entity binding failures as a distinct error category in tool-augmented agents separate from tool selection errors and evaluates entity-aware mechanisms that eliminate such failures in a controlled diagnostic setting.

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.

A hardware-safety-gated system for LLM-written native ARTIQ control code on a trapped-ion platform

quant-ph · 2026-06-25 · unverdicted · novelty 7.0

A token-based authorization system with simulation and human gates enables safe LLM-written ARTIQ control code execution on trapped-ion platforms while blocking unauthorized hardware access.

CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents

cs.AI · 2026-06-20 · unverdicted · novelty 7.0

CFAgentBench is a new reproducible benchmark for construction-finance AI agents featuring 35 mock apps, 1,014 tasks, and a money-movement guard, with initial tests showing pass^1 of 0.67 dropping to pass^5 of 0.38.

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

cs.MA · 2026-06-18 · unverdicted · novelty 7.0

SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

cs.LG · 2026-06-15 · accept · novelty 7.0

Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

Structured recovery suggestions in self-reflective APIs increase AI agent success rates by 36-40pp on Anthropic models versus plain English errors, with 1.8-2.2x token efficiency gains, after leakage audit.

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.

Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

PowerCodeBench and a boundary-aware intervention raise LLM accuracy on power-system code generation by 32-56 points across ten open-weight models and four commercial APIs on a 2,000-task benchmark.

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

cs.SE · 2026-05-24 · unverdicted · novelty 7.0

Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

cs.LG · 2026-05-16 · conditional · novelty 7.0

LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.

citing papers explorer

Showing 25 of 25 citing papers after filters.

Mind2Web: Towards a Generalist Agent for the Web cs.CL · 2023-06-09 · accept · none · ref 26 · internal anchor
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs cs.CL · 2023-04-14 · conditional · none · ref 11 · internal anchor
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages cs.CL · 2026-06-27 · unverdicted · none · ref 11 · internal anchor
SEATauBench is the first agent benchmark for SEA languages, finding that performance holds for language-only changes but degrades sharply with full domain localization.
Sandboxed Coding Agents are Competitive Omni-modal Task Solvers cs.CL · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 63 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 25 · internal anchor
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
GraSP: Graph-Structured Skill Compositions for LLM Agents cs.CL · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, and InterCode.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 61 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
PhoneBuddy: Training Open Models for Agentic Phone Use cs.CL · 2026-06-22 · unverdicted · none · ref 37 · internal anchor
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents cs.CL · 2026-06-10 · unverdicted · none · ref 11 · internal anchor
Autopilot enforces verifiable termination via a gated FSM scheduler and hard floor, proving that termination implies goal achievement under gate soundness, floor enforcement, and plan coverage, while cutting fabrication rates to 0.95% vs. 8-25% in baselines on 3150 paired cells including SWE-bench L
Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments cs.CL · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.
WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents cs.CL · 2026-06-01 · unverdicted · none · ref 28 · internal anchor
WRIT is a synthesis pipeline that generates write-read intensive trajectories along axes of write-decision count and per-decision evidence burden, enabling a 4B model to outperform GPT-5.1 on τ²-bench with reduced inference tokens.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories cs.CL · 2026-05-31 · unverdicted · none · ref 54 · internal anchor
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 64 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 17 · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 34 · internal anchor
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 12 · internal anchor
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization cs.CL · 2026-06-29 · unverdicted · none · ref 170 · internal anchor
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision cs.CL · 2026-06-15 · unverdicted · none · ref 35 · internal anchor
MemSlides introduces a three-part memory hierarchy (user profile, working, tool) with scoped local revision for multi-turn personalized slide generation.
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search cs.CL · 2026-05-14 · unverdicted · none · ref 18 · internal anchor
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 34 · 2 links · internal anchor
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning cs.CL · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
QLoRA fine-tuning on tool-use data enables 4B-parameter models to perform structured planning without tool catalogs in prompts, outperforming informed baselines on AssetOpsBench while reducing input length by 82.6%.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models cs.CL · 2026-04-22 · unverdicted · none · ref 44 · internal anchor
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 221 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Gorilla: Large Language Model Connected with Massive APIs

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer