hub

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez · 2023 · cs.CL · arXiv 2305.15334

46 Pith papers cite this work. Polarity classification is still indexing.

46 Pith papers citing it

open full Pith review browse 46 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When

co-cited works

representative citing papers

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

cs.MA · 2026-05-05 · unverdicted · novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

cs.GR · 2026-04-28 · unverdicted · novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

GraSP: Graph-Structured Skill Compositions for LLM Agents

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, and InterCode.

Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

cs.CR · 2026-04-05 · unverdicted · novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.

EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.

Tool Calling is Linearly Readable and Steerable in Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.

citing papers explorer

Showing 46 of 46 citing papers.

Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 6 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain cs.CR · 2026-04-09 · unverdicted · none · ref 38 · internal anchor
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 2 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 43 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems cs.AI · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents cs.MA · 2026-05-05 · unverdicted · none · ref 34 · internal anchor
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments cs.SE · 2026-05-04 · unverdicted · none · ref 18 · internal anchor
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 25 · internal anchor
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation cs.GR · 2026-04-28 · unverdicted · none · ref 23 · internal anchor
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 3 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation cs.CV · 2026-04-21 · unverdicted · none · ref 25 · internal anchor
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
GraSP: Graph-Structured Skill Compositions for LLM Agents cs.CL · 2026-04-20 · unverdicted · none · ref 3 · internal anchor
GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, and InterCode.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis cs.LG · 2026-04-16 · unverdicted · none · ref 12 · internal anchor
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents cs.CR · 2026-04-05 · unverdicted · none · ref 22 · internal anchor
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 34 · internal anchor
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 128 · internal anchor
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 61 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 26 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation cs.AI · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems cs.AI · 2026-05-09 · unverdicted · none · ref 21 · internal anchor
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 64 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation cs.AI · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis cs.CR · 2026-05-01 · unverdicted · none · ref 33 · internal anchor
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 17 · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Time Series Augmented Generation for Financial Applications cs.AI · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis cs.AI · 2026-04-17 · unverdicted · none · ref 21 · internal anchor
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
Auditable Agents cs.AI · 2026-04-07 · unverdicted · none · ref 14 · internal anchor
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
From Data to Theory: Autonomous Large Language Model Agents for Materials Science cs.AI · 2026-04-01 · unverdicted · none · ref 23 · internal anchor
An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.
ToolRL: Reward is All Tool Learning Needs cs.LG · 2025-04-16 · conditional · none · ref 22 · internal anchor
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents cs.LG · 2024-10-11 · accept · none · ref 20 · internal anchor
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 292 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 38 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay cs.AI · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems cs.AI · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution cs.SE · 2026-04-16 · conditional · none · ref 4 · internal anchor
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation cs.SE · 2026-04-10 · unverdicted · none · ref 7 · internal anchor
A hub-and-spoke IR with a 9-type content model and 10-type stream schema enables bidirectional, lossless translation between major LLM APIs with sub-100 microsecond overhead.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 117 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration cs.AI · 2026-05-05 · unverdicted · none · ref 14 · 2 links · internal anchor
Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work cs.AI · 2026-04-26 · unverdicted · none · ref 80 · internal anchor
Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models cs.CL · 2026-04-22 · unverdicted · none · ref 44 · internal anchor
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
Empirical Comparison of Agent Communication Protocols for Task Orchestration cs.AI · 2026-03-24 · unverdicted · none · ref 29 · 2 links · internal anchor
This work provides an empirical comparison of tool integration, multi-agent delegation, and hybrid architectures for LLM task orchestration, measuring response time, context consumption, cost, error recovery, and implementation complexity.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 34 · internal anchor
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Gorilla: Large Language Model Connected with Massive APIs

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer