pith. machine review for the scientific record. sign in

arxiv: 2604.14723 · v1 · submitted 2026-04-16 · 💻 cs.SE · cs.AI

Recognition: unknown

Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords bounded autonomyenterprise AItyped action contractsconsumer-side executionLLM safetyaction validationmulti-tenant systemsAI orchestration
0
0 comments X

The pith

Typed action contracts and consumer-side execution let language models propose but not perform unsafe enterprise operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that direct operation of enterprise software by language models is unsafe because model errors lead to unauthorized actions, malformed requests, and cross-workspace failures. It presents a bounded-autonomy design in which models interpret natural language intent and propose actions while the enterprise application defines and enforces all executable behavior through typed contracts, permission checks, scoped context, pre-execution validation, and consumer-side boundaries. Evaluation across 25 trials in seven failure families showed the bounded system completing 23 tasks with no unsafe executions, versus 17 for the unconstrained version, with several safety rules enforced by code structure rather than model compliance. This architecture keeps the application as the source of truth and yields 13-18 times faster task completion than manual operation.

Core claim

The enterprise application publishes an explicit actions manifest of typed contracts and permission-aware capabilities; language models may generate proposals against this manifest, but all validation, authorization, and execution occur on the consumer side before any side effects. Across 25 scenario trials spanning seven failure families, this produced 23 successful task completions with zero unsafe executions, while the unconstrained configuration succeeded in only 17. Structured validation feedback also guided the model to correct outcomes in fewer turns than the unconstrained case, which hallucinated success. Two wrong-entity mutations required separate disambiguation and confirmation to

What carries the argument

Typed action contracts that define executable behaviors, permission-aware capabilities, and validation logic enforced before any consumer-side side effects.

If this is right

  • Safety properties become structural code guarantees that hold regardless of language-model output.
  • Removing the validation layers makes the system both less safe and less effective, as feedback loops disappear.
  • Both AI configurations deliver 13-18 times the speed of manual operation.
  • Disambiguation and human confirmation remain necessary for certain entity-selection errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contract-based boundary could apply to consumer-facing assistants where users define personal limits on what the model may trigger.
  • Organizations would need processes to evolve the action manifest as business rules change without creating drift between contracts and actual code.
  • Adding optional human approval gates at specific contract points could further reduce the two residual failure cases observed.

Load-bearing premise

Enterprise applications can correctly define, publish, and maintain the typed action contracts, permission rules, and validation logic without introducing new vulnerabilities or excessive maintenance burden.

What would settle it

A deployed instance in which the bounded-autonomy system permits an unsafe execution that its published contracts and validation were intended to block.

Figures

Figures reproduced from arXiv: 2604.14723 by Ghufran Haider, Sarmad Sohail.

Figure 1
Figure 1. Figure 1: Bounded autonomy architecture. BAL encompasses both the orchestration engine (intent interpretation, action selection over granted capabilities) and portable safety layers (D1--D3). The consumer application defines action contracts, publishes its capability manifest to BAL, and retains execution authority over enterprise state through its own safety layers (D4--D6) [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Large language models are increasingly used as natural-language interfaces to enterprise software, but their direct use as system operators remains unsafe. Model errors can propagate into unauthorized actions, malformed requests, cross-workspace execution, and other costly failures. We argue this is primarily an execution architecture problem. We present a bounded-autonomy architecture in which language models may interpret intent and propose actions, but all executable behavior is constrained by typed action contracts, permission-aware capability exposure, scoped context, validation before side effects, consumer-side execution boundaries, and optional human approval. The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest. We evaluate the architecture in a deployed multi-tenant enterprise application across three conditions: manual operation, unconstrained AI with safety layers disabled, and full bounded autonomy. Across 25 scenario trials spanning seven failure families, the bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions, while the unconstrained configuration completed only 17 of 25. Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class. Both AI conditions delivered 13-18x speedup over manual operation. Critically, removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success. Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output. The result is a practical, deployed architecture for making imperfect language models operationally useful in enterprise systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that using LLMs directly as operators in enterprise software is unsafe due to risks like unauthorized actions and malformed requests. It proposes a bounded-autonomy architecture in which models propose actions but all execution is constrained by typed action contracts, permission-aware capabilities, scoped context, pre-side-effect validation, consumer-side boundaries, and optional human approval, with the enterprise application remaining the source of truth for business logic. In a deployed multi-tenant enterprise application, an evaluation across three conditions (manual, unconstrained AI with safety layers disabled, and full bounded autonomy) reports that the bounded system completed 23 of 25 tasks across seven failure families with zero unsafe executions, versus 17 of 25 for unconstrained AI, while both AI conditions achieved 13-18x speedup over manual; several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output, though two wrong-entity mutations required additional disambiguation.

Significance. If the results hold, the work provides a practical, deployed demonstration that architectural constraints can make imperfect LLMs operationally useful in enterprise settings without sacrificing safety or utility. The three-condition comparison in a real application, including the finding that removing safety layers reduced usefulness due to hallucinated success, offers concrete evidence favoring bounded autonomy over unconstrained use. The structural enforcement of safety properties via contracts and consumer-side execution is a notable strength for reproducibility and generality.

major comments (2)
  1. [Abstract] Abstract: The evaluation reports clear comparative results from 25 scenario trials but provides no details on scenario selection, how the seven failure families were defined or generated, prompt engineering, statistical significance, error bars, or the precise implementation of the three conditions. This information is load-bearing for the central claims of zero unsafe executions and general robustness, as the scenarios must be shown to have been constructed independently of the architecture's strengths.
  2. [Abstract] Abstract: The claim that 'several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output' is qualified by the acknowledgment that two wrong-entity mutations escaped all consumer-contributed layers and required disambiguation/confirmation. The paper should explicitly delineate the scope of structurally enforced properties versus those needing supplementary mechanisms, and confirm that the 25 scenarios exhaustively cover the violation classes the contracts target.
minor comments (1)
  1. [Abstract] Abstract: The reported 13-18x speedup for both AI conditions over manual operation would benefit from clarification on whether it accounts for any additional turns or human interventions in the bounded-autonomy case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the practical significance of the bounded-autonomy architecture. We address each major comment below. The requested clarifications on evaluation methodology and scope of enforcement can be incorporated without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The evaluation reports clear comparative results from 25 scenario trials but provides no details on scenario selection, how the seven failure families were defined or generated, prompt engineering, statistical significance, error bars, or the precise implementation of the three conditions. This information is load-bearing for the central claims of zero unsafe executions and general robustness, as the scenarios must be shown to have been constructed independently of the architecture's strengths.

    Authors: We agree that expanded methodological detail is warranted to support the claims. In the revised manuscript we will add a new subsection (and corresponding abstract text) that specifies: the scenario selection process, which drew from pre-existing enterprise task logs and failure modes observed prior to architecture development to ensure independence; the taxonomy used to define the seven failure families, based on a prior catalog of LLM-induced API errors; the prompt templates and engineering steps for the AI conditions; the exact differences in implementation across the three conditions; and an explicit statement that this is a controlled case study in a deployed multi-tenant system rather than a statistical experiment, so formal significance tests and error bars are not applicable. We will also clarify how the scenarios were constructed to test the targeted violation classes. revision: yes

  2. Referee: [Abstract] Abstract: The claim that 'several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output' is qualified by the acknowledgment that two wrong-entity mutations escaped all consumer-contributed layers and required disambiguation/confirmation. The paper should explicitly delineate the scope of structurally enforced properties versus those needing supplementary mechanisms, and confirm that the 25 scenarios exhaustively cover the violation classes the contracts target.

    Authors: We will revise the abstract and body to delineate the scope more explicitly. The structurally enforced properties (typed contracts, permission-aware capabilities, scoped context, and pre-side-effect validation) cover unauthorized actions, malformed requests, cross-workspace execution, and related classes; these intercepted every instance of the targeted violations across the 25 scenarios. The two wrong-entity mutations represent a distinct semantic-disambiguation class outside the contracts' direct enforcement and are handled by the supplementary confirmation layer. We will state that the 25 scenarios were designed to cover the primary violation families addressed by the contracts but do not claim exhaustive coverage of every possible entity-reference mutation; the escaped cases demonstrate the value of the additional mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is independent of any derivation chain

full rationale

The paper describes a system architecture for bounded autonomy using typed action contracts and consumer-side execution, then reports results from 25 scenario trials across three configurations (manual, unconstrained AI, bounded autonomy). The headline claims rest entirely on these observed outcomes (23/25 completions with zero unsafe executions, 13-18x speedup, structural enforcement of safety properties) rather than any first-principles derivation, parameter fitting, or self-referential logic. No equations, predictions, ansatzes, or uniqueness theorems appear; the text contains no self-citations that bear load on the central claims. The evaluation is presented as a direct comparison in a deployed application, making the result self-contained against external benchmarks with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that enterprise applications can accurately define and enforce typed action contracts and validation without error or excessive restriction; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Enterprise applications can correctly define, publish, and maintain typed action contracts, permission-aware capabilities, and validation logic as the source of truth.
    Invoked in the description of the architecture where the enterprise app remains the source of truth for business logic and authorization.

pith-pipeline@v0.9.0 · 5583 in / 1457 out tokens · 49514 ms · 2026-05-10T11:11:58.716850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 23 canonical work pages · 8 internal anchors

  1. [1]

    Schick, T., et al. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761. [https://arxiv.org/abs/2302.04761](https://arxiv.org/abs/2302.04761)

  2. [2]

    Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. [https://arxiv.org/abs/2210.03629](https://arxiv.org/abs/2210.03629)

  3. [3]

    Shen, Y., et al. (2023). HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. arXiv preprint arXiv:2303.17580. [https://arxiv.org/abs/2303.17580](https://arxiv.org/abs/2303.17580)

  4. [4]

    Gorilla: Large Language Model Connected with Massive APIs

    Patil, S. G., et al. (2023). Gorilla: Large language model connected with massive APIs. arXiv preprint arXiv:2305.15334. [https://arxiv.org/abs/2305.15334](https://arxiv.org/abs/2305.15334)

  5. [5]

    Qin, Y., et al. (2023). ToolLLM: Facilitating large language models to master 16000+ real -world APIs. arXiv preprint arXiv:2307.16789. [https://arxiv.org/abs/2307.16789](https://arxiv.org/abs/2307.16789)

  6. [6]

    Anthropic. (2024). Model Context Protocol specification. [https://modelcontextprotocol.io](https://modelcontextprotocol.io)

  7. [7]

    Fan, S., Ding, X., Zhang, L., & Mo, L. (2025). MCPToolBench++: A large-scale AI agent Model Context Protocol MCP tool use benchmark. arXiv preprint arXiv:2508.07575. [https://arxiv.org/abs/2508.07575](https://arxiv.org/abs/2508.07575)

  8. [8]

    Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155)

  9. [9]

    Bai, Y., et al. (2022a). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. [https://arxiv.org/abs/2204.05862](https://arxiv.org/abs/2204.05862)

  10. [10]

    Bai, Y., et al. (2022b). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073. [https://arxiv.org/abs/2212.08073](https://arxiv.org/abs/2212.08073)

  11. [11]

    Meta. (2025). LlamaFirewall: An open source guardrail system for building secure AI agents. arXiv preprint arXiv:2505.03574. [https://arxiv.org/abs/2505.03574](https://arxiv.org/abs/2505.03574)

  12. [12]

    NVIDIA. (2024). NeMo Guardrails: A toolkit for controllable and safe LLM applications. [https://github.com/NVIDIA/NeMo-Guardrails](https://github.com/NVIDIA/NeMo-Guardrails)

  13. [13]

    Avinash, K., Pareek, N., & Hada, R. (2025). Protect: Towards robust guardrailing stack for trustworthy enterprise LLM systems. arXiv preprint arXiv:2510.13351. [https://arxiv.org/abs/2510.13351](https://arxiv.org/abs/2510.13351)

  14. [14]

    MI9 – agent intelligence protocol: Runtime governance for agentic AI systems,

    Wang, C. L., et al. (2025). MI9: An integrated runtime governance framework for agentic AI. arXiv preprint arXiv:2508.03858. [https://arxiv.org/abs/2508.03858](https://arxiv.org/abs/2508.03858)

  15. [15]

    Willis, J. M. (2026). The PBSAI governance ecosystem: A multi-agent AI reference architecture for securing enterprise AI estates. arXiv preprint arXiv:2602.11301. [https://arxiv.org/abs/2602.11301](https://arxiv.org/abs/2602.11301)

  16. [16]

    Shi, T., et al. (2025). Progent: Programmable privilege control for LLM agents. arXiv preprint arXiv:2504.11703. [https://arxiv.org/abs/2504.11703](https://arxiv.org/abs/2504.11703)

  17. [17]

    Uchibeke, U. (2026). Before the tool call: Deterministic pre-action authorization for autonomous AI agents. arXiv preprint arXiv:2603.20953. [https://arxiv.org/abs/2603.20953](https://arxiv.org/abs/2603.20953)

  18. [18]

    Syros, G., Suri, A., Ginesin, J., Nita-Rotaru, C., & Oprea, A. (2025). SAGA: A security architecture for governing AI agentic systems. arXiv preprint arXiv:2504.21034. [https://arxiv.org/abs/2504.21034](https://arxiv.org/abs/2504.21034)

  19. [19]

    Liu, X., Yang, X., Li, Z., Li, P., & He, R. (2026). AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents. arXiv preprint arXiv:2601.06818. [https://arxiv.org/abs/2601.06818](https://arxiv.org/abs/2601.06818)

  20. [20]

    Kokane, S., et al. (2024). ToolScan: A benchmark for characterizing errors in tool-use LLMs. arXiv preprint arXiv:2411.13547. [https://arxiv.org/abs/2411.13547](https://arxiv.org/abs/2411.13547)

  21. [21]

    PALADIN: Self-correcting language model agents to cure tool-failure cases.arXiv preprint arXiv:2509.25238, 2025

    Vuddanti, S. V., Shah, A., Chittiprolu, S. K., Song, T., Dev, S., Zhu, K., & Chaudhary, M. (2026). PALADIN: Self-correcting language model agents to cure tool-failure cases. ICLR 2026. arXiv preprint arXiv:2509.25238. [https://arxiv.org/abs/2509.25238](https://arxiv.org/abs/2509.25238)

  22. [22]

    Abel, D., et al. (2017). Agent-agnostic human-in-the-loop reinforcement learning. arXiv preprint arXiv:1701.04079. [https://arxiv.org/abs/1701.04079](https://arxiv.org/abs/1701.04079)

  23. [23]

    Li, Q., et al. (2022). Efficient learning of safe driving policy via human-AI copilot optimization. arXiv preprint arXiv:2202.10341. [https://arxiv.org/abs/2202.10341](https://arxiv.org/abs/2202.10341)

  24. [24]

    Prospeo. (2026). CRM data entry: How to fix it for good. [https://prospeo.io/s/crm -data- entry](https://prospeo.io/s/crm-data-entry)

  25. [25]

    EverReady. (2026). 13 astonishing statistics for CRM data entry automation. [https://everready.ai/13 -statistics- for-crm-data-entry-automation/](https://everready.ai/13-statistics-for-crm-data-entry-automation/)

  26. [26]

    Insightly. (2026). Assess CRM performance with CRM benchmarks. [https://www.insightly.com/blog/crm - benchmarks/](https://www.insightly.com/blog/crm-benchmarks/)

  27. [27]

    Everstage. (2026). Sales productivity statistics: Trends & data. [https://www.everstage.com/sales - productivity/sales-productivity-statistics](https://www.everstage.com/sales-productivity/sales-productivity- statistics)

  28. [28]

    Salesforce AI Research. (2026). Generative AI benchmark for CRM. [https://www.salesforceairesearch.com/crm-benchmark](https://www.salesforceairesearch.com/crm- benchmark)

  29. [29]

    Validity. (2026). CRM data & databases: Types, importance, & management tips. [https://www.validity.com/blog/crm-data/](https://www.validity.com/blog/crm-data/)

  30. [30]

    OpenAI. (2024b). GPT-4o mini model. OpenAI API Documentation. [https://developers.openai.com/api/docs/models/gpt-4o- mini](https://developers.openai.com/api/docs/models/gpt-4o-mini)

  31. [31]

    OpenAI. (2025a). o3-mini model. OpenAI API Documentation. [https://developers.openai.com/api/docs/models/o3-mini](https://developers.openai.com/api/docs/models/o3- mini)

  32. [32]

    OpenAI Developer Community. (2025). o3-mini `parallel_tool_calls` not supported. OpenAI Developer Forum. [https://community.openai.com/t/o3-mini-tool-choice-not-working-during-streaming-also-parallel-tool-calls- just-doesnt-work-with-this-model/1113520](https://community.openai.com/t/o3-mini-tool-choice-not- working-during-streaming-also-parallel-tool-cal...

  33. [33]

    OpenAI. (2025b). Function calling. OpenAI API Documentation. [https://platform.openai.com/docs/guides/function-calling](https://platform.openai.com/docs/guides/function- calling)

  34. [34]

    Kamath, A., Zhang, S., Xu, C., Ugare, S., Singh, G., & Misailovic, S. (2025). Enforcing temporal constraints for LLM agents. arXiv preprint arXiv:2512.23738. [https://arxiv.org/abs/2512.23738](https://arxiv.org/abs/2512.23738)

  35. [35]

    OpenAI. (2026). Building governed AI agents: A practical guide to agentic scaffolding. OpenAI Cookbook. [https://developers.openai.com/cookbook/examples/partners/agentic_governance_guide/agentic_governance_ cookbook](https://developers.openai.com/cookbook/examples/partners/agentic_governance_guide/agentic_g overnance_cookbook)

  36. [36]

    MongoDB. (2026). The case for bounded autonomy---from single agents to reliable agent teams. MongoDB Engineering Blog. [https://www.mongodb.com/company/blog/technical/the-case-for-bounded- autonomy](https://www.mongodb.com/company/blog/technical/the-case-for-bounded-autonomy)

  37. [37]

    KPMG. (2026). AI at scale: How 2025 set the stage for agent-driven enterprise reinvention in 2026. KPMG AI Quarterly Pulse Survey Q4 2025. [https://kpmg.com/us/en/media/news/q4-ai- pulse.html](https://kpmg.com/us/en/media/news/q4-ai-pulse.html)