pith. machine review for the scientific record. sign in

arxiv: 2605.14312 · v1 · submitted 2026-05-14 · 💻 cs.SE

Recognition: no theorem link

Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:41 UTC · model grok-4.3

classification 💻 cs.SE
keywords OpenAPIAI agentsdocumentation smellsREST APIsmulti-agent LLMModel Context ProtocolAPI documentationsemantic readiness
0
0 comments X

The pith

OpenAPI documentation valid for microservices often lacks the semantic structure needed for AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stable REST APIs in a microservice architecture can still cause systematic failures when exposed to MCP-based AI agents for task planning and payload construction. To investigate, the authors built Hermes, a multi-agent LLM system that scans OpenAPI specifications for documentation and REST smells at the endpoint level. In a real ecosystem of 600 endpoints the system found 2,450 smells with every operation affected. Practitioner reviews aligned with the detections and highlighted trade-offs in fixes, leading the organization to shift from broad adoption to selective remediation and new documentation standards. The work demonstrates that structural correctness alone does not prepare APIs for agent consumption.

Core claim

The central claim is that structural validity of OpenAPI documentation within microservice environments does not guarantee semantic readiness for consumption by MCP-based AI agents. Hermes, a multi-agent LLM system, detected documentation and REST-related smells across 600 production endpoints and produced 2,450 diagnostic reports, revealing deficiencies in all analyzed operations. Practitioner validation confirmed the issues while noting contextual remediation trade-offs, which prompted the organization to revise its adoption strategy toward selective endpoint adaptation, updated documentation standards, and automated assessment integrated into governance.

What carries the argument

Hermes, a multi-agent LLM system that identifies documentation and REST smells at the endpoint level and generates explainable diagnostic reports.

If this is right

  • Organizations must assess OpenAPI documentation for agent-specific semantic issues before MCP integration.
  • Automated smell detection can be embedded into ongoing API governance workflows.
  • Selective rather than universal endpoint adaptation reduces risk when exposing APIs to agents.
  • Redefined documentation standards become necessary for reliable agent tool use.
  • Systematic artifact evaluation serves as evidence-based input for AI adoption decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same smell-detection approach could be applied to non-OpenAPI API descriptions used by agents.
  • Persistent documentation issues may explain adoption friction in other organizations moving to MCP.
  • Quantifying agent performance gains after smell fixes would strengthen the causal link beyond practitioner opinion.
  • The method offers a reusable pattern for turning existing service ecosystems into agent-compatible tool sets.

Load-bearing premise

The smells detected by the multi-agent system are the main drivers of observed agent failures and that practitioner agreement is sufficient validation without direct before-and-after measurements of agent task success.

What would settle it

Measure success rates of MCP-based agents performing the same tasks on the original 600 endpoints versus the same endpoints after targeted remediation of the detected smells.

read the original abstract

The growing adoption of AI agents and the Model Context Protocol (MCP) has motivated organizations to expose existing REST APIs as agent-consumable tools. In our industrial context, this initiative targeted an ecosystem of 16 production APIs comprising approximately 600 endpoints. Although these APIs were stable and widely used within a microservice architecture, early proof-of-concept experiments revealed systematic failures in task planning, tool selection, and payload construction when accessed through MCP-based agents. Rather than attributing these failures to model limitations alone, we conducted an ecosystem-scale empirical assessment of the underlying OpenAPI documentation. We developed Hermes, a multi-agent LLM-based system that detects documentation and REST-related smells at the endpoint level and generates explainable diagnostic reports. The large-scale evaluation identified 2,450 smells across 600 endpoints, with deficiencies present in all analyzed operations. Practitioner validation confirmed high agreement with the detected issues while also revealing contextual trade-offs in remediation decisions. The findings suggested that structural validity within microservice environments does not guarantee semantic readiness for agent-based consumption. Based on this evidence, the organization revised its adoption strategy, prioritizing selective endpoint adaptation, redefining documentation standards, and integrating automated documentation assessment into API governance workflows. This case illustrates how systematic artifact-level evaluation can function as a strategic decision-support mechanism, reducing technological risk and guiding evidence-based AI adoption in industrial software ecosystems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Hermes, a multi-agent LLM-based system that detects documentation and REST-related smells in OpenAPI specifications at the endpoint level. Evaluated on an industrial ecosystem of 16 production APIs comprising ~600 endpoints, the system identified 2,450 smells present in all analyzed operations. Practitioner validation showed high agreement on relevance, with some contextual trade-offs noted. The authors conclude that structural validity within microservice architectures does not guarantee semantic readiness for agent consumption via the Model Context Protocol (MCP), prompting the organization to revise its adoption strategy toward selective endpoint adaptation, updated documentation standards, and automated assessment in governance workflows.

Significance. If the detection approach and causal attributions hold, the work offers a concrete, scalable method for assessing API documentation readiness for AI agents, addressing a practical gap between conventional REST design and emerging agent-based consumption patterns. The industrial scale and direct link to strategy changes provide evidence that systematic artifact evaluation can reduce adoption risk in software ecosystems.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim that the detected smells are the primary driver of observed agent failures in task planning, tool selection, and payload construction lacks support from quantitative before-and-after agent performance data (e.g., task completion rates or planning accuracy) on the same endpoints. Practitioner agreement confirms smell presence and noticeability but does not establish that remediation improves agent behavior or exclude model/protocol factors as dominant.
  2. [Smell detection and definitions] Smell detection and definitions: The manuscript reports 2,450 smells but provides insufficient detail on precise smell definitions, the multi-agent LLM detection rules or prompts, precision/recall metrics, or error analysis. Without these, it is difficult to assess the reliability of the large-scale findings or replicate the system.
minor comments (2)
  1. [Abstract] Abstract and introduction: Clarify how the explainable diagnostic reports are structured and delivered to practitioners, including any examples of report format or remediation suggestions.
  2. [Related work] The manuscript could add a brief comparison to existing static analysis tools for OpenAPI or REST smells to better position the novelty of the multi-agent LLM approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that the detected smells are the primary driver of observed agent failures in task planning, tool selection, and payload construction lacks support from quantitative before-and-after agent performance data (e.g., task completion rates or planning accuracy) on the same endpoints. Practitioner agreement confirms smell presence and noticeability but does not establish that remediation improves agent behavior or exclude model/protocol factors as dominant.

    Authors: We agree that direct quantitative evidence linking smell remediation to improved agent performance would strengthen causal claims. The current work is an observational study of documentation quality across an industrial ecosystem, motivated by early agent failures but not structured as a controlled before-and-after experiment. Practitioner validation establishes that the smells are both present and relevant. In the revision we will temper causal language in the evaluation and conclusions, add an explicit limitations subsection acknowledging the absence of performance metrics, and outline future controlled experiments needed to isolate documentation effects from model or protocol variables. revision: partial

  2. Referee: [Smell detection and definitions] Smell detection and definitions: The manuscript reports 2,450 smells but provides insufficient detail on precise smell definitions, the multi-agent LLM detection rules or prompts, precision/recall metrics, or error analysis. Without these, it is difficult to assess the reliability of the large-scale findings or replicate the system.

    Authors: We appreciate the emphasis on reproducibility. The revised manuscript will include a substantially expanded appendix containing: precise definitions and detection criteria for each smell category; the complete prompts, decision rules, and agent interaction protocol used by the multi-agent LLM system; precision and recall values computed against the practitioner validation set; and a concise error analysis of disagreements and edge cases observed during validation. These additions will allow readers to evaluate the reliability of the 2,450-smell findings and replicate the detection pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of documentation smells

full rationale

The paper conducts a direct empirical scan of 600 production endpoints using the Hermes multi-agent LLM detector, reports 2,450 detected smells, and obtains independent practitioner agreement on relevance. No equations, fitted parameters, predictions, or derivations are present that reduce to self-definitions, self-citations, or renamed inputs. The central claim (structural validity does not guarantee semantic readiness) rests on observed failures and external validation rather than any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-detected smells accurately capture barriers to agent use and that the observed failures stem primarily from documentation rather than model limitations.

axioms (1)
  • domain assumption LLM-based multi-agent detection reliably surfaces documentation and REST smells that cause agent failures
    Invoked implicitly when interpreting the 2450 detected smells as the root cause of task planning and payload failures.

pith-pipeline@v0.9.0 · 5560 in / 1168 out tokens · 27911 ms · 2026-05-15T02:41:59.087149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    Abdullah AH Alzahrani. 2024. Software Systems Documentation: A Systematic Review.International Journal of Advanced Computer Science & Applications15, 8 (2024)

  2. [2]

    Anthropic. 2024. Model Context Protocol (MCP). https://modelcontextprotocol. io

  3. [3]

    Jayachandu Bandlamudi, Ritwik Chaudhuri, Neelamadhav Gantayat, Sambit Ghosh, Kushal Mukherjee, Prerna Agarwal, Renuka Sindhgatta, and Sameep Mehta. 2025. A framework for testing and adapting rest apis as llm tools.arXiv preprint arXiv:2504.15546(2025)

  4. [4]

    Sandra Casas, Diana Cruz, Graciela Vidal, and Marcela Constanzo. 2021. Uses and applications of the OpenAPI/Swagger specification: a systematic mapping of the literature. In2021 40th International Conference of the Chilean Computer Science Society (SCCC). IEEE, 1–8

  5. [5]

    Michael Coblenz, Wentao Guo, Kamatchi Voozhian, and Jeffrey S. Foster. 2023. A Qualitative Study of REST API Design and Specification Practices. In2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 148–156

  6. [6]

    Leonardo da Rocha Araújo, Guillermo Rodríguez, Santiago Vidal, Claudia Marcos, and Rodrigo Pereira dos Santos. 2022. Empirical Analysis on OpenAPI Topic Exploration and Discovery to Support the Developer Community.Computing and Informatics40, 6 (2022), 1345–1369. doi:10.31577/cai_2021_6_1345

  7. [7]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  8. [8]

    2000.Architectural Styles and the Design of Network-based Software Architectures

    Roy Thomas Fielding. 2000.Architectural Styles and the Design of Network-based Software Architectures. Ph. D. Dissertation. University of California, Irvine

  9. [9]

    P Gowda and AN Gowda. 2024. Best Practices in REST API Design for Enhanced Scalability and Security.Journal of Artificial Intelligence, Machine Learning and Data Science2, 1 (2024), 827–830

  10. [10]

    Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhen- dong Mao. 2025. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools.arXiv preprint arXiv:2509.09734(2025)

  11. [11]

    Tawkat Islam Khondaker, Gias Uddin, and Anindya Iqbal

    Junaed Younus Khan, Md. Tawkat Islam Khondaker, Gias Uddin, and Anindya Iqbal. 2021. Automatic Detection of Five API Documentation Smells: Practitioners’ Perspectives.arXiv preprint arXiv:2102.08486(2021). https://arxiv.org/abs/2102. 08486

  12. [12]

    Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, and Xiuzhen Cheng

  13. [13]

    We urgently need privilege management in mcp: A measurement of api usage in mcp ecosystems.arXiv preprint arXiv:2507.06250(2025)

  14. [14]

    Rayfran Rocha Lima, Irineu Evangelista Cruz de Brito, Jardel da Cunha Nasci- mento, and Elias Nascimento Nogueira. 2025. Empowering Conversational Systems Through AI Agents Specialized in Reusable APIs Across Software Ecosys- tems . In2025 IEEE 37th International Conference on Tools with Artificial Intel- ligence (ICTAI). IEEE Computer Society, Los Alamit...

  15. [15]

    O’Reilly Media, Inc

    Mark Masse. 2011.REST API design rulebook: designing consistent RESTful web service interfaces. " O’Reilly Media, Inc. "

  16. [16]

    Meriem Mastouri, Emna Ksontini, and Wael Kessentini. 2025. Making rest apis agent-ready: From openapi to mcp servers for tool-augmented llms.arXiv preprint arXiv:2507.16044(2025)

  17. [17]

    Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, and Pengyu Hong. 2025. Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation.arXiv preprint arXiv:2506.19998(2025)

  18. [18]

    OpenAPI Initiative. 2024. OpenAPI Specification. https://spec.openapis.org/oas/ latest.html

  19. [19]

    Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, and Vincent J Hellendoorn. 2022. Test smells 20 years later: detectability, validity, and reliability.Empirical Software Engineering27, 7 (2022), 170

  20. [20]

    Cesare Pautasso, Olaf Zimmermann, and Frank Leymann. 2014. RESTful Web Services: Principles, Patterns, Emerging Technologies.IEEE Software31, 3 (2014), 54–61

  21. [21]

    Robillard

    Martin P. Robillard. 2017. What makes APIs hard to learn? Answers from devel- opers.IEEE Software34, 6 (2017), 27–34

  22. [22]

    Per Runeson and Martin Höst. 2009. Guidelines for Conducting and Reporting Case Study Research in Software Engineering.Empirical Software Engineering 14, 2 (2009), 131–164. doi:10.1007/s10664-008-9102-8

  23. [23]

    Theo Theunissen, Uwe van Heesch, and Paris Avgeriou. 2022. A mapping study on documentation in Continuous Software Development.Information and Software Technology142 (2022), 106733. doi:10.1016/j.infsof.2021.106733

  24. [24]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    Jules White et al. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.arXiv preprint arXiv:2302.11382(2023). APPENDIX A Example of Prompt Used by Specialized Agents Box A.1: Prompt Template for the Lazy Documentation Smell Agent Instructionsforthe Specialist Agent Role: You are an expertinidentifying "Lazy" documentation smells inAP...

  25. [25]

    Justification and evidence of the smell:

    "Justification and evidence of the smell:" - Present evidence of incomplete or vague documentation. - Use bullet points with the exact format: - [Complete sentence]

  26. [26]

    Suggested actions to address the smell:

    "Suggested actions to address the smell:" - Provide concrete, actionable recommendations to improve documentation completeness. - Use a TABLE format with the following column: - Action (must follow the exact format: "[LAZY] - [ action title]") Mandatory Output Structure: - Use only the specified bullet format for evidence. - Return a VALID JSON object con...