pith. sign in

arxiv: 2606.20615 · v2 · pith:V35BLM23new · submitted 2026-05-24 · 💻 cs.AI · cs.MA· cs.PL· cs.SE

Specifying AI-SDLC Processes: A Protocol Language for Human-Agent Boundaries

Pith reviewed 2026-06-30 11:47 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.PLcs.SE
keywords protocol languageAI-SDLChuman-agent boundariesstructural enforcementfailure rate analysisseparation of dutiesoperational semanticsvalidation tokens
0
0 comments X

The pith

A protocol language for AI-SDLC processes uses structural enforcement via validation tokens and capability boundaries to bound system failure rates at the weighted product of agent and validator rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a domain-specific language for specifying protocols that define human-agent responsibility boundaries, approval gates, and governance constraints in the software development lifecycle. It supplies formal syntax, well-formedness conditions, operational semantics, and enforcement invariants while distinguishing declared policy from enforceable mechanism. This setup allows implementations to limit process non-determinism through primitives such as validation tokens and capability boundaries. One result is that structural enforcement produces failure rates equal to a weighted product of the component rates, whereas behavioral compliance permits cumulative or near-saturating growth. The language also yields the 2+N team pattern as a formalization of separation of duties and treats Kleene closure of orchestration loops plus reflexive adherence validation as emergent design properties.

Core claim

The paper establishes a protocol language for AI-SDLC with formal syntax, well-formedness conditions, operational semantics, and enforcement invariants. By distinguishing policy from mechanism, implementations can bound non-determinism using validation tokens and capability boundaries. This yields three results: structural enforcement bounds failure rates at a weighted product of agent and validator rates while behavioral compliance permits cumulative or near-saturating growth; the 2+N team pattern formalizes Separation of Duties for AI-SDLC; and Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties.

What carries the argument

The protocol language with its operational semantics and enforcement invariants, which realize structural enforcement through validation tokens and capability boundaries to bound non-determinism.

If this is right

  • Structural enforcement bounds system failure rates at a weighted product of agent and validator rates.
  • Behavioral compliance permits cumulative or near-saturating growth in failure rates.
  • The 2+N team pattern formalizes classical Separation of Duties for AI-SDLC.
  • Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties rather than special-case constructs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The language could extend to governance of agent teams in domains other than software development where similar responsibility boundaries are required.
  • Practical implementations may show whether the stated bounds on non-determinism hold when agents exhibit partial observability or learning during execution.
  • The distinction between policy and mechanism offers a route to auditability that could be tested by generating compliance logs directly from the protocol state.

Load-bearing premise

The proposed language's operational semantics and enforcement invariants can be realized through primitives such as validation tokens and capability boundaries in a way that bounds non-determinism.

What would settle it

An implemented AI-SDLC system using the protocol language in which measured failure rates under structural enforcement exceed the weighted product of agent and validator rates.

Figures

Figures reproduced from arXiv: 2606.20615 by Ylli Prifti.

Figure 1
Figure 1. Figure 1: shows how human intent flows through orchestrator collaboration into the execution loop. The process begins with intent expression, proceeds through architectural breakdown, and enters a cycle of governance checking, validation, execution, and progression, with human escalation when needed. Validators are invoked at their declared evaluation phase: specification validators run be￾fore Execute, artifact val… view at source ↗
Figure 2
Figure 2. Figure 2: Policy versus mechanism. Left: behavioural compliance leaves enforcement to agent [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The 2+N team pattern showing producer and reviewer modes separated by infras [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: False-block versus false-pass rate for four disagreement policies ( [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Refusal-mode breakdown for good tasks (N = 3, real severities). HALT refusals arise from blocker-severity validator outputs; ESCALATE refusals arise from consensus failure. Stricter policies produce a larger ESCALATE share, revealing the mechanism by which they raise friction on well-formed work [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation of the quorum-miss formula (P quorum miss = p K v + ρ(pv − p K v ), pv = 0.2). Simulated points (markers) track the analytical curves (lines) for both K = 3 and K = 5 across the full correlation range. At ρ = 1 the quorum provides no benefit over a single validator [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structural versus behavioural failure probability by pipeline length [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity of the Pbehavioural/Pstructural ratio to validator correlation ρ and recovery￾loop failure precovery. The ratio exceeds unity across the entire plane and is largest when valida￾tors are near-independent and recovery is reliable. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Adversarial validator sensitivity (N = 5, p = 0.7). Left: under always-pass ad￾versaries, false-pass rate is policy-dependent, with Unanimous most robust (first meaningful leakage at f = 3) and Any most fragile (f = 1). Right: under always-block adversaries, a single compromised validator (f = 1) drives all policies to 100% false-block: INV3 operating as specified. The asymmetry is the quantified cost of n… view at source ↗
read the original abstract

AI agents now participate as first-class team members across the software development lifecycle, yet no specification language exists for expressing the human-agent responsibility boundaries, approval gates, and governance constraints this collaboration requires. Existing approaches encode process in agent prompts (subject to drift), target adjacent domains (workflow management, business processes), or address only fragments (access control, approval gates). We propose a domain-specific language for specifying AI-SDLC processes as protocols, with formal syntax, well-formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non-determinism through primitives such as validation tokens and capability boundaries. Three results follow. A failure rate analysis shows that structural enforcement bounds system failure rates at a weighted product of agent and validator rates, while behavioral compliance permits cumulative or near-saturating growth. The 2+N team pattern (two human-in-control roles plus N specialized agent members) formalizes classical Separation of Duties for AI-SDLC. Kleene closure of orchestration loops and reflexive protocol-adherence validation emerge as design properties rather than special-case constructs. We position the contribution against multi-agent frameworks (MetaGPT), workflow specification (FlowAgent, BPMN extensions), and capability-based security (SAGA): the novelty lies in the specific integration, not any single primitive. A working implementation demonstrates feasibility; empirical evaluation is future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a domain-specific protocol language for AI-SDLC processes that distinguishes policy (declared intent) from mechanism (structural enforcement). It supplies formal syntax, well-formedness conditions, operational semantics, and enforcement invariants using primitives such as validation tokens and capability boundaries. Three results are claimed: a failure-rate analysis in which structural enforcement bounds system failure rates at a weighted product of agent and validator rates (while behavioral compliance permits cumulative growth); formalization of the 2+N team pattern as Separation of Duties; and emergence of design properties including Kleene closure of orchestration loops and reflexive protocol-adherence validation. A feasibility implementation is presented; empirical evaluation is stated as future work.

Significance. If the failure-rate bound can be rigorously derived from the operational semantics and invariants, the work would supply a concrete integration of process specification with capability-based enforcement that is absent from existing multi-agent frameworks and workflow languages. The policy-mechanism distinction and the 2+N pattern are cleanly stated contributions that could serve as reference points for governance of human-agent teams.

major comments (2)
  1. [Abstract] Abstract (failure-rate analysis paragraph): the central quantitative claim states that structural enforcement bounds failure rates at a weighted product of agent and validator rates, yet no probabilistic model, independence assumptions, equations, or derivation linking validation tokens and capability boundaries to failure events is supplied. The manuscript explicitly defers empirical evaluation and provides only a feasibility implementation, leaving the bound as an informal assertion rather than a consequence of the stated syntax and invariants.
  2. [Abstract / operational semantics] Abstract (three results paragraph) and operational-semantics section: the claim that the language primitives eliminate all other failure paths (required for the product bound to hold) is not accompanied by an explicit invariant or proof sketch showing that non-determinism is confined to the agent and validator components.
minor comments (1)
  1. [Related work] The positioning against MetaGPT, FlowAgent, BPMN, and SAGA is clear, but the manuscript would benefit from a short table contrasting the new language's primitives with those frameworks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our formal results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (failure-rate analysis paragraph): the central quantitative claim states that structural enforcement bounds failure rates at a weighted product of agent and validator rates, yet no probabilistic model, independence assumptions, equations, or derivation linking validation tokens and capability boundaries to failure events is supplied. The manuscript explicitly defers empirical evaluation and provides only a feasibility implementation, leaving the bound as an informal assertion rather than a consequence of the stated syntax and invariants.

    Authors: We agree that the failure-rate bound is summarized at a high level in the abstract and requires an explicit probabilistic model and derivation to be presented as a consequence of the syntax and invariants. The manuscript defines the relevant operational semantics and enforcement invariants, but the probabilistic analysis linking them to the weighted product bound is not fully expanded. In the revision we will add the probabilistic model, independence assumptions, equations, and derivation in the failure-rate analysis section, making the bound a direct consequence of the formal elements while retaining the statement that empirical evaluation is future work. revision: yes

  2. Referee: [Abstract / operational semantics] Abstract (three results paragraph) and operational-semantics section: the claim that the language primitives eliminate all other failure paths (required for the product bound to hold) is not accompanied by an explicit invariant or proof sketch showing that non-determinism is confined to the agent and validator components.

    Authors: The operational semantics and well-formedness conditions are intended to confine non-determinism to agent and validator actions via the invariants on validation tokens and capability boundaries. However, we acknowledge that an explicit invariant statement and proof sketch are not provided. We will add both to the operational-semantics section in the revised manuscript to rigorously establish the confinement of failure paths. revision: yes

Circularity Check

0 steps flagged

No circularity: claims presented as consequences of independent semantics

full rationale

The abstract defines a DSL with syntax, well-formedness, operational semantics and enforcement invariants, then states that three results (including the weighted-product failure bound) follow from those definitions. No equations, fitted parameters, self-citations, or ansatzes are quoted that would make any result equivalent to its inputs by construction. The failure-rate claim is asserted as an analysis outcome rather than a redefinition or statistical fit; the paper explicitly defers empirical evaluation. Because no load-bearing step reduces to a self-referential input, the derivation chain is treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a formally defined language whose operational semantics and invariants can be implemented to enforce the stated bounds; this is an invented artifact whose independent evidence is not supplied in the abstract.

axioms (1)
  • standard math Standard mathematical definitions of syntax, operational semantics, and invariants suffice to support the claimed enforcement properties.
    Invoked when the abstract states the language has formal syntax, well-formedness conditions, operational semantics, and enforcement invariants.
invented entities (1)
  • AI-SDLC protocol language with policy-mechanism distinction no independent evidence
    purpose: To specify and structurally enforce human-agent boundaries and governance constraints
    New language and primitives (validation tokens, capability boundaries) introduced to solve the stated problem; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5790 in / 1321 out tokens · 31035 ms · 2026-06-30T11:47:03.953188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Measuring AI Agent Autonomy in Practice

    [Anthropic(2026)] Anthropic. Measuring AI Agent Autonomy in Practice. Anthropic Research,

  2. [2]

    L., & Cabot, J

    [Ait et al.(2024)] Ait, A., Cánovas Izquierdo, J. L., & Cabot, J. Towards Modeling Human- Agentic Collaborative Workflows: A BPMN Extension. arXiv preprint arXiv:2412.05958,

  3. [3]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    [AutoGen(2023)] Wu, Q., Bansal, G., Zhang, J., et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv preprint arXiv:2308.08155,

  4. [4]

    M.Team Roles at Work, 2nd ed

    [Belbin(2012)] Belbin, R. M.Team Roles at Work, 2nd ed. Routledge,

  5. [5]

    A., & Eloff, J

    [Botha & Eloff(2001)] Botha, R. A., & Eloff, J. H. P. Separation of Duties for Access Control Enforcement in Workflow Environments.IBM Systems Journal, 40(3), 666–682,

  6. [6]

    Analyzing Autonomous Software Engi- neering Agents

    [Bouzenia & Pradel(2025)] Bouzenia, I., & Pradel, M. Analyzing Autonomous Software Engi- neering Agents. InInternational Conference on Software Engineering,

  7. [7]

    Business Process Model and Notation (BPMN), Version 2.0.2

    [BPMN(2013)] Object Management Group. Business Process Model and Notation (BPMN), Version 2.0.2. 2013.https://www.omg.org/spec/BPMN/2.0.2/ [CodeAnt AI(2025)] CodeAnt AI. Code Review Best Practices. 2025.https://www.codeant.ai [CrewAI(2024)] CrewAI. CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents. 2024.https://www.crewai.com/ 25 [DeMa...

  8. [8]

    B., & Van Horn, E

    [Dennis & Van Horn(1966)] Dennis, J. B., & Van Horn, E. C. Programming Semantics for Multiprogrammed Computations.Communications of the ACM, 9(3), 143–155,

  9. [9]

    Agentic Development Lifecycle (ADLC): A New Model for AI Systems Beyond SDLC

    [EPAM Systems(2026)] EPAM Systems. Agentic Development Lifecycle (ADLC): A New Model for AI Systems Beyond SDLC. EPAM Technical Report, February

  10. [10]

    CreateYourAI-EnhancedSDLCTransformation 90-Plus-Day Roadmap

    [Forrester Research(2025)] ForresterResearch. CreateYourAI-EnhancedSDLCTransformation 90-Plus-Day Roadmap. Forrester Research Report, December

  11. [11]

    Security Implications of Large Language Model Code Assistants: A User Study.ACM Transactions on Software Engineering and Methodology, 33(4),

    [Fu et al.(2024)] Fu, M., van Tonder, R., Gulwani, S., Mechtaev, S., & Tomasic, N. Security Implications of Large Language Model Code Assistants: A User Study.ACM Transactions on Software Engineering and Methodology, 33(4),

  12. [12]

    AI SDLC: Redefining Developer Work

    [Grid Dynamics(2026)] Grid Dynamics. AI SDLC: Redefining Developer Work. January

  13. [13]

    https://www.griddynamics.com [Hong et al.(2024)] Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., et al. MetaGPT: Meta Programming for a Multi- Agent Collaborative Framework. InInternational Conference on Learning Representations (ICLR),

  14. [14]

    Silent Failures in AI Code Generation.IEEE Spec- trum, January

    [IEEE Spectrum(2025)] IEEE Spectrum. Silent Failures in AI Code Generation.IEEE Spec- trum, January

  15. [15]

    [Kleene(1956)] Kleene, S. C. Representation of Events in Nerve Nets and Finite Automata. In Shannon, C. E., & McCarthy, J. (eds.),Automata Studies, 3–41. Princeton University Press,

  16. [16]

    LangGraph: Multi-Agent Workflows

    [LangChain(2024)] LangChain. LangGraph: Multi-Agent Workflows. 2024.https://www. langchain.com/langgraph [LangChain(2026)] LangChain. Human-in-the-Loop Middleware. 2026.https://docs. langchain.com/ [Li et al.(2022)] Li, A. S., Safavi-Naini, R., & Fong, P. W. L. A Capability-based Distributed Authorization System to Enforce Context-aware Permission Sequenc...

  17. [17]

    Security and Privacy Controls for Information Systems and Organizations

    [NIST(2020)] National Institute of Standards and Technology. Security and Privacy Controls for Information Systems and Organizations. NIST Special Publication 800-53 Rev. 5,

  18. [18]

    Agentic SDLC: A Complete Guide

    [Pandit(2025)] Pandit, R. Agentic SDLC: A Complete Guide. November 2025.https: //rajatpandit.com [Pearce et al.(2022)] Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InIEEE Symposium on Security and Privacy,

  19. [19]

    2502.14345 , archivePrefix =

    [Ran The Builder(2026)] Ran The Builder. AI-Driven SDLC: Build Secure, Scalable Software with AI. February 2026.https://ranthebuilder.cloud [Shi et al.(2025)] Shi, Y., Cai, S., Xu, Z., Qin, Y., Hu, X., Xu, J., Liao, Y., Lu, Y., Wang, S., Xiong, L., Lin, J., Liu, Z., & Qiu, X. FlowAgent: Achieving Compliance and Flexibility for Workflow Agents. arXiv prepr...

  20. [20]

    Audit Trail Complications in AI Code Generation

    26 [SitePoint(2026)] SitePoint. Audit Trail Complications in AI Code Generation. 2026.https: //www.sitepoint.com [Token Security(2026)] Token Security. The Shift From Credentials to Capabilities in AI Access Control Systems. 2026.https://www.token.security/blog/ [van der Aalst & ter Hofstede(2005)] van der Aalst, W. M. P., & ter Hofstede, A. H. M. YAWL: Y...

  21. [21]

    SAGA: A security architecture for governing AI agentic systems,

    [Waxell(2026)] Waxell. Human-in-the-Loop vs Human-on-the-Loop for AI Agents. 2026.https: //www.waxell.ai/blog/human-in-the-loop-vs-human-on-the-loop-ai-agents [Yao et al.(2025)] Yao, Y., et al. SAGA: A Security Architecture for Governing AI Agentic Systems. arXiv preprint arXiv:2504.21034,