hub Canonical reference

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao · 2023 · cs.CL · arXiv 2306.05301

Canonical reference. 80% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Enabling large language models to utilize real-world tools effectively is crucial for achieving embodied intelligence. Existing approaches to tool learning have either primarily relied on extremely large language models, such as GPT-4, to attain generalized tool-use abilities in a zero-shot manner, or utilized supervised learning to train limited scopes of tools on compact models. However, it remains uncertain whether smaller language models can achieve generalized tool-use abilities without tool-specific training. To address this question, this paper introduces ToolAlpaca, a novel framework designed to automatically generate a diverse tool-use corpus and learn generalized tool-use abilities on compact language models with minimal human intervention. Specifically, ToolAlpaca first automatically creates a highly diversified tool-use corpus by building a multi-agent simulation environment. The corpus contains 3938 tool-use instances from more than 400 real-world tool APIs spanning 50 distinct categories. Subsequently, the constructed corpus is employed to fine-tune compact language models, resulting in two models, namely ToolAlpaca-7B and ToolAlpaca-13B, respectively. Finally, we evaluate the ability of these models to utilize previously unseen tools without specific training. Experimental results demonstrate that ToolAlpaca achieves effective generalized tool-use capabilities comparable to those of extremely large language models like GPT-3.5, demonstrating that learning generalized tool-use ability is feasible for compact language models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1

citation-polarity summary

background 4 use dataset 1

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Evaluating Tool Cloning in Agentic-AI Ecosystems

cs.SE · 2026-05-10 · conditional · novelty 7.0 · 2 refs

Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep MCP repository pairs manually verified as true clones.

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning benchmarks under strong/weak initialization.

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

cs.AI · 2025-10-16 · unverdicted · novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.

The Scaling Laws of Skills in LLM Agent Systems

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.

Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.

Querying Structured Data Through Natural Language Using Language Models

cs.CL · 2026-04-03 · conditional · novelty 6.0

Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

cs.AI · 2026-04-02 · unverdicted · novelty 6.0 · 2 refs

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

cs.LG · 2026-03-25 · unverdicted · novelty 6.0

A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.

Benchmarking LLM Tool-Use in the Wild

cs.HC · 2026-02-13 · unverdicted · novelty 6.0

WildToolBench shows no LLM exceeds 15 percent accuracy on tool-use tasks that reflect real user behaviors like compositional orchestration, implicit intents across turns, and mixed instructions.

AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

cs.AI · 2026-02-05 · unverdicted · novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.

Memory in the Age of AI Agents

cs.CL · 2025-12-15 · unverdicted · novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

Learning to Ask: When LLM Agents Meet Unclear Instruction

cs.CL · 2024-08-31 · unverdicted · novelty 6.0

Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

Cognitive Architectures for Language Agents

cs.AI · 2023-09-05 · accept · novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

citing papers explorer

Showing 33 of 33 citing papers.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 52 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 8 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 54 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs cs.CL · 2023-04-14 · conditional · none · ref 16 · internal anchor
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 24 · internal anchor
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Evaluating Tool Cloning in Agentic-AI Ecosystems cs.SE · 2026-05-10 · conditional · none · ref 66 · 2 links · internal anchor
Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep MCP repository pairs manually verified as true clones.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level cs.LG · 2026-05-07 · unverdicted · none · ref 31 · 3 links · internal anchor
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning benchmarks under strong/weak initialization.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning cs.AI · 2026-04-10 · unverdicted · none · ref 22 · internal anchor
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling cs.AI · 2025-10-16 · unverdicted · none · ref 37 · internal anchor
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 49 · internal anchor
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences cs.CR · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs cs.SE · 2026-05-17 · unverdicted · none · ref 13 · internal anchor
FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 15 · internal anchor
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 29 · 2 links · internal anchor
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model cs.LG · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
Querying Structured Data Through Natural Language Using Language Models cs.CL · 2026-04-03 · conditional · none · ref 16 · internal anchor
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis cs.AI · 2026-04-02 · unverdicted · none · ref 30 · 2 links · internal anchor
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards cs.LG · 2026-03-25 · unverdicted · none · ref 9 · internal anchor
A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.
Benchmarking LLM Tool-Use in the Wild cs.HC · 2026-02-13 · unverdicted · none · ref 2 · internal anchor
WildToolBench shows no LLM exceeds 15 percent accuracy on tool-use tasks that reflect real user behaviors like compositional orchestration, implicit intents across turns, and mixed instructions.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 19 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 264 · internal anchor
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 25 · internal anchor
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 78 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 275 · internal anchor
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments cs.CL · 2025-08-12 · unverdicted · none · ref 37 · internal anchor
An automated environment construction pipeline plus verifiable rewards enables RL training that improves LLM tool-use performance across scales without harming general capabilities.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions cs.AI · 2025-01-27 · unverdicted · none · ref 152 · internal anchor
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 205 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 88 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA cs.IR · 2026-04-07 · unverdicted · none · ref 28 · internal anchor
A pipeline of dataset construction from prior work, AugFC parameter augmentation, and two-step LLM training improves function calling for financial APIs and is running in production.
Bridging Language Models and Financial Analysis q-fin.ST · 2025-03-14 · unverdicted · none · ref 97 · internal anchor
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures cs.LG · 2026-05-12 · unreviewed · ref 76 · internal anchor
TInR: Exploring Tool-Internalized Reasoning in Large Language Models cs.CL · 2026-04-12 · unreviewed · ref 3 · internal anchor

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer