arxiv: 2605.10555 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

Kai Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentstool APIssemantic interfacesenterprise systemsagent autonomyAPI designtask automationerror recovery

0 comments

The pith

Agent-First Tool APIs let AI agents finish enterprise tasks at 88 percent success by replacing human-oriented CRUD interfaces with semantic protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard APIs create five specific problems for autonomous AI agents, including dependence on exact identifiers, responses meant for screens rather than decisions, and error messages that agents cannot act on. To fix this the authors present a new paradigm built around a six-phase interaction protocol, contracts that supply decision metadata, and rules that balance capability checks with risk handling. Experiments in a live multi-tenant system with 85 tools show markedly higher task completion and far fewer calls for human help than optimized conventional APIs. If the approach holds, agents could move from prototypes into reliable production use without constant manual repair. The design is presented as a semantic layer that sits above existing transport protocols rather than replacing them.

Core claim

The Agent-First Tool API paradigm resolves five architectural mismatches between conventional CRUD APIs and autonomous agent requirements through three mechanisms: a Six-Verb Semantic Protocol that structures interactions as search, resolve, preview, execute, verify, and recover phases; a Normalized Tool Contract that supplies structured metadata such as confidence scores, evidence chains, and suggested next actions; and a dual-layer governance pipeline that combines static capability policies with dynamic risk escalation. Implemented across 85 registered tools in six business domains, the paradigm was tested on 50 real operational tasks and produced an 88 percent end-to-end success rate, a

What carries the argument

The Six-Verb Semantic Protocol, which decomposes every tool interaction into the ordered phases of search, resolve, preview, execute, verify, and recover so agents can plan and recover without human-style assumptions.

If this is right

Agents complete substantially more end-to-end tasks without requiring human intervention.
Autonomous error recovery improves by a factor of roughly six, lowering the frequency of failed runs.
The semantic layer remains compatible with existing transport and discovery standards rather than replacing them.
Structured decision metadata in tool contracts supports better agent planning and verification steps.
The same interface pattern applies across multiple business domains within a single production environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tool providers outside the tested platform may need to expose similar phased protocols and metadata if agents are to use them reliably at scale.
The emphasis on preview and verify phases could reduce the risk of agents taking irreversible actions in live systems.
Adoption might shift API design priorities from human usability toward explicit support for machine decision chains.

Load-bearing premise

That performance gains measured on 50 tasks inside one multi-tenant SaaS platform will hold for enterprise AI agent work in other platforms and domains.

What would settle it

A controlled test applying the same Agent-First paradigm to an equivalent set of 50 operational tasks in a second, unrelated enterprise platform that shows no measurable rise in success rate or drop in human interventions.

Figures

Figures reproduced from arXiv: 2605.10555 by Kai Pan.

**Figure 2.** Figure 2: Dual API architecture with shared infrastructure. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear new semantic layer for agent tool use with production numbers attached, but the single-platform experiments leave the performance gains hard to attribute or generalize.

read the letter

The main point is a redesign of tool interfaces aimed at autonomous enterprise agents. It names five concrete mismatches with standard CRUD APIs—exact ID dependence, human-oriented outputs, single-shot assumptions, user-level auth, and vague errors—and responds with three pieces: the Six-Verb Semantic Protocol (search, resolve, preview, execute, verify, recover), a Normalized Tool Contract that adds decision metadata like confidence and next actions, and dual-layer governance. That combination sits on top of existing protocols like MCP and is presented as new. The work ships an implementation across 85 tools in six domains inside one multi-tenant SaaS setup and reports results on 50 operational tasks: 88% success versus 64% for optimized CRUD baselines, 72.7% fewer human interventions, and 5.8x better autonomous recovery. Those are real deployment metrics rather than toy benchmarks, which is worth noting.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies five architectural mismatches between conventional CRUD-style APIs and the needs of autonomous AI agents in enterprise settings. It proposes the Agent-First Tool API paradigm, which integrates a Six-Verb Semantic Protocol (search, resolve, preview, execute, verify, recover), a Normalized Tool Contract (NTC) supplying structured metadata such as confidence scores and suggested actions, and a dual-layer governance pipeline. The paradigm is implemented in a production multi-tenant SaaS platform supporting 85 tools across 6 domains and evaluated on 50 real operational tasks, reporting an 88% end-to-end success rate (versus 64% for optimized CRUD baselines), a 72.7% reduction in human interventions, and a 5.8x improvement in autonomous error recovery. The work positions the approach as orthogonal to transport-layer standards such as MCP.

Significance. If the reported gains prove robust and generalizable, the paradigm could meaningfully improve tool-use reliability for production AI agents by shifting from human-oriented to agent-oriented interfaces. The production deployment on real tasks and tools provides practical grounding, and the explicit complementarity to existing protocols is a constructive contribution. However, the single-platform scope and limited methodological transparency constrain the strength of claims about broader enterprise applicability.

major comments (3)

[Evaluation section] The evaluation reports results on 50 operational tasks but provides no characterization of task selection criteria, domain distribution, complexity metrics, or sampling method. This detail is load-bearing for the central claim of a +37.5% success-rate improvement, as the observed deltas could be sensitive to how the tasks were chosen.
[Baseline Comparison subsection] The construction of the 'optimized CRUD baselines' is not described in sufficient detail (e.g., whether they received equivalent metadata, error-handling hooks, or agent tuning). Without this, it is impossible to determine whether the reported reductions in interventions and error-recovery gains are attributable to the Six-Verb protocol and NTC rather than differences in baseline implementation.
[Results Analysis] No statistical tests, confidence intervals, variance measures, or controls for confounding factors (agent model, prompt strategy, or platform-specific features) accompany the headline metrics (88% vs 64%, 72.7% intervention reduction, 5.8x recovery). This absence weakens verification of the quantitative claims.

minor comments (2)

[Abstract] The abstract asserts 'five fundamental architectural mismatches' without enumerating them; an explicit list in the introduction or a dedicated section would improve clarity.
[Paradigm Description] A summary table or diagram early in the paper that contrasts the Six-Verb phases and NTC fields against conventional CRUD operations would aid reader comprehension of the proposed paradigm.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which identify important areas for improving methodological transparency. We address each major comment below, indicating where we will revise the manuscript to provide additional detail and where limitations inherent to the production deployment constrain further elaboration.

read point-by-point responses

Referee: [Evaluation section] The evaluation reports results on 50 operational tasks but provides no characterization of task selection criteria, domain distribution, complexity metrics, or sampling method. This detail is load-bearing for the central claim of a +37.5% success-rate improvement, as the observed deltas could be sensitive to how the tasks were chosen.

Authors: We agree that explicit characterization of the task set is necessary to support the reported gains. In the revised manuscript we will add a dedicated subsection under Evaluation that specifies: task selection criteria (operational tasks drawn from production logs that involve multi-step tool chains, error conditions, or cross-domain dependencies); domain distribution (approximate counts across the six business domains); complexity metrics (mean tool calls per task, presence of recovery scenarios); and sampling method (stratified sampling from a larger pool of logged tasks to balance domain coverage while preserving real-world frequency). We will also note any selection biases introduced by focusing on tasks that reached the agent layer. revision: yes
Referee: [Baseline Comparison subsection] The construction of the 'optimized CRUD baselines' is not described in sufficient detail (e.g., whether they received equivalent metadata, error-handling hooks, or agent tuning). Without this, it is impossible to determine whether the reported reductions in interventions and error-recovery gains are attributable to the Six-Verb protocol and NTC rather than differences in baseline implementation.

Authors: We acknowledge the need for greater clarity on baseline construction. The revision will expand the Baseline Comparison subsection to state that both conditions used the identical agent model, prompt template, and orchestration framework. The CRUD baseline received standard OpenAPI documentation plus basic HTTP error codes and retry hooks, but lacked the NTC metadata fields, semantic phase decomposition, and governance pipeline. Agent-level tuning parameters were held constant; the only intentional difference was the tool interface itself. We will explicitly discuss the inherent limits to equivalence given that CRUD APIs cannot natively supply the structured decision-support metadata present in the Agent-First design. revision: yes
Referee: [Results Analysis] No statistical tests, confidence intervals, variance measures, or controls for confounding factors (agent model, prompt strategy, or platform-specific features) accompany the headline metrics (88% vs 64%, 72.7% intervention reduction, 5.8x recovery). This absence weakens verification of the quantitative claims.

Authors: We accept that the current presentation lacks statistical framing. We will add to the Results Analysis section binomial proportion confidence intervals (Wilson score) for the 88 % and 64 % success rates, together with a note that the intervention-reduction and recovery-multiplier figures derive from paired within-task comparisons. Because the study was conducted on a single production platform, full factorial controls for every platform-specific feature are not feasible; we will therefore add an explicit limitations paragraph discussing potential confounding from the shared infrastructure while confirming that agent model and prompt strategy were fixed across conditions. Variance estimates from task-level logs will also be reported where available. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation of proposed API paradigm

full rationale

The paper proposes an Agent-First Tool API paradigm with Six-Verb protocol, NTC metadata, and governance pipeline, then reports direct comparative measurements (88% vs 64% success, 72.7% fewer interventions, 5.8x recovery) on 50 real tasks using 85 tools in one production SaaS platform. No equations, fitted parameters, or derivation steps appear in the provided text. No self-citations are invoked as load-bearing premises for the core claims. The results are presented as measured outcomes from experiments rather than quantities that reduce to the inputs by construction or renaming. This is a standard empirical architecture paper whose central claims rest on external task execution data, not internal definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the five listed architectural mismatches are fundamental and that the proposed mechanisms address them in a generalizable way; no free parameters are fitted to data in the reported results, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption The five architectural mismatches (exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics) are fundamental and universal for agent-tool interactions.
Explicitly listed in the abstract as the motivation for the new paradigm.

invented entities (2)

Six-Verb Semantic Protocol no independent evidence
purpose: Decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases.
Newly defined mechanism introduced by the paper.
Normalized Tool Contract (NTC) no independent evidence
purpose: Provides structured decision-support metadata including confidence scores, evidence chains, and suggested next actions.
Newly defined contract structure introduced by the paper.

pith-pipeline@v0.9.0 · 5531 in / 1526 out tokens · 51301 ms · 2026-05-12T04:14:38.273348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[2]

Gorilla: Large Language Model Connected with Massive APIs

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,”arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerber, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inProc. ICLR, 2024

work page 2024
[4]

API-Bank: A comprehensive benchmark for tool-augmented LLMs,

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li, “API-Bank: A comprehensive benchmark for tool-augmented LLMs,” in Proc. EMNLP, 2023

work page 2023
[5]

ai: Com- pleting tasks by connecting foundation models with millions of apis

Y . Liang, C. Wu, T. Song, W. Wu, Y . Xia, Y . Liu, Y . Ou, S. Lu, L. Ji, S. Mao, Y . Wang, S. Shu, and others, “TaskMatrix.AI: Completing tasks by connecting foundation models with millions of APIs,”arXiv preprint arXiv:2303.16434, 2023

work page arXiv 2023
[6]

Function calling and other API updates,

OpenAI, “Function calling and other API updates,” OpenAI Blog, June 2023. [Online]. Available: https://openai.com/blog/ function-calling-and-other-api-updates

work page 2023
[7]

Tool use (function calling),

Anthropic, “Tool use (function calling),” Anthropic Documenta- tion, 2024. [Online]. Available: https://docs.anthropic.com/claude/docs/ tool-use

work page 2024
[8]

Model Context Protocol Specification,

Anthropic, “Model Context Protocol Specification,” 2024. [Online]. Available: https://modelcontextprotocol.io/specification [Accessed: Jan. 15, 2025]

work page 2024
[9]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inProc. ICLR, 2023

work page 2023
[10]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

E. Karpas, O. Abend, Y . Belinkov, B. Lenz, O. Liber, N. Ratner, Y . Shoham, H. Bata, Y . Levine, K. Leyton-Brown, D. Muber, and N. Rozen, “MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,”arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review arXiv 2022
[11]

LangChain: Building applications with LLMs through composability,

LangChain, “LangChain: Building applications with LLMs through composability,” 2023. [Online]. Available: https: //github.com/langchain-ai/langchain

work page 2023
[12]

Auto-GPT: An autonomous GPT-4 experiment,

T. Richards, “Auto-GPT: An autonomous GPT-4 experiment,” 2023. [Online]. Available: https://github.com/Significant-Gravitas/Auto-GPT

work page 2023
[13]

CrewAI: Framework for orchestrating role-playing au- tonomous AI agents,

J. Moura, “CrewAI: Framework for orchestrating role-playing au- tonomous AI agents,” 2024. [Online]. Available: https://github.com/ joaomdmoura/crewAI

work page 2024
[14]

MetaGPT: Meta programming for a multi-agent collaborative framework,

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inProc. ICLR, 2024

work page 2024
[15]

On the tool manipulation capability of open-source large language models.arXiv preprint arXiv:2305.16504(2023)

Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, “On the tool manipulation capability of open-source large language models,”arXiv preprint arXiv:2305.16504, 2023

work page arXiv 2023
[16]

Kong Gateway: Cloud-native API gateway,

Kong Inc., “Kong Gateway: Cloud-native API gateway,” 2023. [Online]. Available: https://konghq.com

work page 2023
[17]

Envoy proxy: Cloud-native high-performance edge/mid- dle/service proxy,

Envoy Project, “Envoy proxy: Cloud-native high-performance edge/mid- dle/service proxy,” 2023. [Online]. Available: https://www.envoyproxy.io

work page 2023
[18]

The OAuth 2.0 authorization framework,

D. Hardt, “The OAuth 2.0 authorization framework,”RFC 6749, Internet Engineering Task Force, October 2012

work page 2012
[19]

Role- based access control models,

R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman, “Role- based access control models,”IEEE Computer, vol. 29, no. 2, pp. 38–47, 1996

work page 1996
[20]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[21]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face,

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face,” in Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[22]

OpenAGI: When LLM meets domain experts,

Y . Ge, W. Hua, K. Mei, J. Ji, J. Tan, S. Xu, Z. Li, and Y . Zhang, “OpenAGI: When LLM meets domain experts,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[23]

A Survey on Large Language Model based Autonomous Agents

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J. Wen, “A survey on large language model based autonomous agents,”arXiv preprint arXiv:2308.11432, 2023

work page internal anchor Pith review arXiv 2023
[24]

Tptu: Task planning and tool usage of large language model-based ai agents

J. Ruan, Y . Chen, B. Zhang, Z. Xu, T. Bao, G. Du, S. Shi, H. Mao, X. Zeng, and R. Zhao, “TPTU: Task planning and tool usage of large language model-based AI agents,”arXiv preprint arXiv:2308.03427, 2023

work page arXiv 2023
[25]

Agent-to-Agent (A2A) Protocol,

Google, “Agent-to-Agent (A2A) Protocol,” Technical Report, 2024. [Online]. Available: https://github.com/google/A2A

work page 2024