arxiv: 2302.04761 · v1 · submitted 2023-02-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Toolformer: Language Models Can Teach Themselves to Use Tools

Jane Dwivedi-Yu, Luke Zettlemoyer, Maria Lomeli, Nicola Cancedda, Roberta Raileanu, Roberto Dess\`i, Thomas Scialom, Timo Schick

Pith reviewed 2026-05-10 19:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelstool useself-supervised learningAPIszero-shot performanceexternal toolsToolformer

0 comments

The pith

Language models can teach themselves to use external tools via APIs, improving zero-shot performance on tasks like arithmetic and factual lookup without losing core language abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models good at following instructions but weak on precise calculations or fact lookups can train themselves to call external tools such as calculators, search engines, and translators. Training happens in a self-supervised manner with only a handful of demonstrations per tool, teaching the model when to call, what arguments to pass, and how to fold results into its predictions. The resulting Toolformer model delivers substantially better zero-shot results across tasks, often matching much larger models. This happens while the model's basic next-token prediction performance stays intact. If correct, the work points to a route for combining flexible language generation with reliable external computation.

Core claim

Toolformer is trained to decide which APIs to call, when to call them, what arguments to pass, and how to incorporate results into future token prediction. This occurs through a self-supervised process that needs nothing more than a small number of demonstrations for each API, covering tools including a calculator, Q&A system, two search engines, a translation system, and a calendar. The model then achieves substantially improved zero-shot performance across downstream tasks, often competitive with much larger models, while preserving its core language modeling abilities.

What carries the argument

Self-supervised generation and learning from API calls, where the model produces tool invocations during training and receives their outputs as part of the prediction target.

If this is right

Smaller models become competitive with much larger ones on arithmetic and knowledge tasks through selective tool use.
Models integrate results from multiple distinct tools during a single generation without task-specific supervision.
Zero-shot capabilities expand across diverse tasks while core next-token prediction remains unchanged.
Tool use generalizes beyond the demonstrated APIs to new downstream problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could scale to chaining multiple tool calls for step-by-step reasoning in future work.
Equipping smaller models with external tools might reduce the need for ever-larger parameter counts on factual or computational tasks.
Similar self-supervised signals could apply to other external systems such as code interpreters or databases.

Load-bearing premise

A small number of demonstrations per API is enough for the model to learn reliable decisions about when and how to use tools without introducing harmful biases or over-reliance.

What would settle it

Measure zero-shot task accuracy after training the same base model without the self-supervised API call data; if gains disappear or language modeling perplexity rises, the claim does not hold.

read the original abstract

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Toolformer, a language model that learns to invoke external tools (calculator, search engines, QA system, translator, calendar) via simple APIs in a fully self-supervised manner. Candidate API calls are generated by the base LM, executed, and retained only if their results reduce next-token cross-entropy loss on the training data; only a handful of demonstrations per API are required. Experiments demonstrate substantially improved zero-shot performance on arithmetic, factual lookup, and other downstream tasks, often competitive with much larger models, while language-modeling perplexity remains essentially unchanged.

Significance. If the central results hold, the work is significant because it supplies a scalable, low-supervision recipe for augmenting LMs with tools that directly improves next-token prediction. The self-supervised filtering procedure and the requirement of only a few demonstrations per API are genuine strengths that distinguish the method from heavily supervised tool-use pipelines. The claim that core LM capabilities are preserved is particularly valuable for practical deployment.

major comments (2)

[§3.2] §3.2 (Self-supervised data generation): the loss-reduction filter retains any call whose execution lowers next-token loss, but does not penalize calls that merely echo information already predictable from context or that exploit dataset artifacts. This selection criterion is load-bearing for the claim that the resulting policy learns reliable 'when-to-call' decisions; an ablation that measures tool-call frequency on prompts that do not require tools (or on out-of-distribution inputs) is needed to substantiate that core LM abilities remain intact.
[§4.2–4.3] §4.2–4.3 (zero-shot downstream results): the reported gains on arithmetic and QA benchmarks are large, yet the manuscript provides neither per-run standard deviations nor statistical significance tests against the strongest baselines. Because the central claim is that Toolformer is 'often competitive with much larger models,' the absence of these statistics makes it impossible to judge whether the improvements are robust or sensitive to the particular data-filtering thresholds.

minor comments (2)

[§3.1] The notation for API call insertion (e.g., the exact token sequence used to delimit tool results) is described only informally in §3.1; a small example table would improve reproducibility.
[Figure 2] Figure 2 (example trajectories) would benefit from explicit annotation of which tokens were generated by the model versus returned by the tool, to clarify the training signal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive review, which highlights the significance of the self-supervised tool-use approach. We address each major comment in detail below, providing clarifications based on the method and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Self-supervised data generation): the loss-reduction filter retains any call whose execution lowers next-token loss, but does not penalize calls that merely echo information already predictable from context or that exploit dataset artifacts. This selection criterion is load-bearing for the claim that the resulting policy learns reliable 'when-to-call' decisions; an ablation that measures tool-call frequency on prompts that do not require tools (or on out-of-distribution inputs) is needed to substantiate that core LM abilities remain intact.

Authors: We agree that the loss-reduction filter is central to learning reliable tool-use decisions. However, the criterion inherently penalizes unhelpful calls: if a candidate API call merely echoes information already predictable from context, incorporating its result cannot reduce next-token cross-entropy loss on the subsequent tokens (the model already assigns high probability to the correct continuation). The same holds for calls that exploit transient dataset artifacts without providing generalizable predictive value. This is why only a small fraction of sampled calls are retained. To directly address the request for further substantiation, we have added a new ablation in the revised manuscript that measures tool-call frequency on prompts that do not require tools (e.g., standard next-token prediction on held-out text) as well as on out-of-distribution inputs. The results confirm that Toolformer invokes tools at rates comparable to or below the base model in these settings, supporting that core language modeling behavior is preserved. revision: yes
Referee: [§4.2–4.3] §4.2–4.3 (zero-shot downstream results): the reported gains on arithmetic and QA benchmarks are large, yet the manuscript provides neither per-run standard deviations nor statistical significance tests against the strongest baselines. Because the central claim is that Toolformer is 'often competitive with much larger models,' the absence of these statistics makes it impossible to judge whether the improvements are robust or sensitive to the particular data-filtering thresholds.

Authors: We acknowledge that reporting variability and statistical tests would allow readers to better assess robustness. Our experiments were conducted with multiple random seeds for the key model variants and data-filtering thresholds, but we reported only mean performance in the original submission. In the revised manuscript we will include per-run standard deviations for the main zero-shot results and add statistical significance tests (paired t-tests) against the strongest baselines. These additions will also include a brief sensitivity analysis with respect to the loss-reduction threshold used during data filtering, confirming that the reported gains remain stable across reasonable threshold choices. revision: yes

Circularity Check

0 steps flagged

Toolformer's self-supervised loss-based filtering yields independent downstream gains

full rationale

The paper's method generates candidate API calls from the base LM, executes them, and retains only those that reduce next-token cross-entropy loss on the training distribution before fine-tuning the model to predict such calls. This procedure is fully grounded in the standard language modeling objective and does not rely on self-citations, uniqueness theorems, or ansatzes imported from prior work by the same authors. Zero-shot evaluations on separate downstream tasks (arithmetic, QA, etc.) are not equivalent to the training filter by construction; any observed improvements constitute genuine generalization rather than tautological re-labeling of the input data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that language models can effectively learn tool invocation from a small number of demonstrations without external supervision or reward signals beyond next-token loss.

axioms (1)

domain assumption Language models can be fine-tuned to insert and interpret tool calls while preserving general language modeling performance.
Invoked implicitly when claiming no sacrifice to core abilities.

pith-pipeline@v0.9.0 · 5496 in / 1183 out tokens · 32747 ms · 2026-05-10T19:51:28.053365+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence law_of_existence unclear
We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
BIM Information Extraction Through LLM-based Adaptive Exploration
cs.CL 2026-05 unverdicted novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey
cs.SE 2026-05 unverdicted novelty 7.0

Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations...
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
cs.SE 2026-04 unverdicted novelty 7.0

TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-ti...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook
cs.CY 2026-04 unverdicted novelty 7.0

Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
AgileLog: A Forkable Shared Log for Agents on Data Streams
cs.DC 2026-04 unverdicted novelty 7.0

AgileLog introduces forkable shared logs with cheap forking and isolation to support AI agents on data streams.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution
cs.RO 2026-04 unverdicted novelty 7.0

A runtime governance framework for embodied agents achieves 96.2% interception of unauthorized actions and 91.4% recovery success in 1000 simulation trials by externalizing policy enforcement.
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
cs.AI 2024-06 unverdicted novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Reflexion: Language Agents with Verbal Reinforcement Learning
cs.AI 2023-03 conditional novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
cs.AI 2026-05 unverdicted novelty 6.0

A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization
cs.SE 2026-05 conditional novelty 6.0

Deterministic orchestration matches LLM-controlled methods in COBOL-to-Python translation accuracy but improves worst-case robustness, reduces run-to-run variability, and cuts token consumption by up to 3.5 times.
SkillGen: Verified Inference-Time Agent Skill Synthesis
cs.LG 2026-05 unverdicted novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
cs.SE 2026-05 unverdicted novelty 6.0

MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
cs.CL 2026-05 unverdicted novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
cs.CR 2026-04 conditional novelty 6.0

AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
cs.CL 2026-04 unverdicted novelty 6.0

SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
cs.CR 2026-04 unverdicted novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 6.0

Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
cs.AI 2026-04 unverdicted novelty 6.0

A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
cs.CV 2026-04 unverdicted novelty 6.0

GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
cs.RO 2026-04 unverdicted novelty 6.0

A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis
cs.AI 2026-04 unverdicted novelty 6.0

LDTL trains LLM agents to follow uncertainty-guided latent diagnostic trajectories, outperforming baselines on MIMIC-CDM with higher accuracy and fewer tests.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Towards Predicting Multi-Vulnerability Attack Chains in Software Supply Chains from Software Bill of Materials Graphs
cs.SE 2026-04 unverdicted novelty 6.0

The paper shows that heterogeneous graph attention networks can classify vulnerable components in real SBOMs at 91% accuracy and that a simple MLP can predict documented multi-vulnerability chains with 0.93 ROC-AUC.
Querying Structured Data Through Natural Language Using Language Models
cs.CL 2026-04 conditional novelty 6.0

Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
MemGPT: Towards LLMs as Operating Systems
cs.AI 2023-10 unverdicted novelty 6.0

MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Discovery of Interpretable Surrogates via Agentic AI: Application to Gravitational Waves
gr-qc 2026-05 unverdicted novelty 5.0

GWAgent agentic workflow produces analytic surrogates for eccentric BBH waveforms with 6.9e-4 median mismatch and 8.4x speedup, outperforming baselines, and infers eccentricity for GW200129.
NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
cs.AI 2026-05 unverdicted novelty 5.0

NeuroAgent uses a hierarchical LLM agent framework with Generate-Execute-Validate loops to automate neuroimaging preprocessing, reaching 84.8% end-to-end correctness and 0.9518 AUC for Alzheimer's classification on 14...
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
cs.CR 2026-05 unverdicted novelty 5.0

A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.