pith. sign in

arxiv: 2605.20874 · v1 · pith:7G326TAOnew · submitted 2026-05-20 · 💻 cs.AI · cs.SE

Governance by Construction for Generalist Agents

Pith reviewed 2026-05-21 04:48 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords LLM agentspolicy as codegovernanceenterprise AIcompliancehuman-in-the-loopagent workflows
0
0 comments X

The pith

A modular policy-as-code layer steers generalist LLM agents through five checkpoints to enforce compliance without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that governance can be built directly into the execution pipeline of a generalist agent by intercepting it at fixed structural points rather than retraining or rebuilding it for each new domain. These interventions occur upstream of planning, inside the reasoning prompt, at tool boundaries, as human approval gates, and on final outputs. A reader would care because this setup promises to make autonomous agents usable in regulated settings such as healthcare while keeping behavior predictable and auditable across compound workflows.

Core claim

The CUGA policy system composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning by intercepting the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter).

What carries the argument

Five structural checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter) that perform policy interventions at successive stages of the agent's execution pipeline.

If this is right

  • Agents can be deployed in new regulated domains by updating only the policy definitions rather than retraining the base model.
  • High-risk actions are automatically routed through human approval without altering the agent's core reasoning loop.
  • Policy adherence and execution traces become continuously auditable because interventions are explicit and logged at each checkpoint.
  • Different enterprise workflows can reuse the same generalist agent by swapping modular policy sets at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the checkpoints prove robust, organizations could standardize on a small set of governance primitives across multiple agent platforms.
  • The same interception pattern might extend to non-LLM autonomous systems that perform sequenced tool use.
  • Runtime policy injection could reduce the cost of adapting agents to changing regulations compared with periodic fine-tuning cycles.

Load-bearing premise

External policy interventions at the five checkpoints can reliably steer and constrain generalist LLM agents across arbitrary compound workflows without introducing new failure modes or requiring model changes.

What would settle it

A controlled test in which the agent completes a compound workflow that violates an active policy, such as executing a restricted tool or exposing protected information, while all five checkpoints remain active.

read the original abstract

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CUGA, a modular policy-as-code layer designed to provide governance for generalist LLM agents. It enforces policy interventions at five structural checkpoints—Intent Guard, Playbook, Tool Guide, Tool Approvals, and Output Formatter—to achieve predictable, auditable, and compliance-aware behavior in compound workflows without requiring model fine-tuning. The approach is illustrated with a qualitative healthcare demo scenario demonstrating dynamic playbook injection, intent guards, and human-in-the-loop approvals.

Significance. The proposed architecture addresses an important practical challenge in deploying autonomous agents in enterprise environments by embedding governance directly into the agent's execution pipeline. The use of typed governance primitives and the modular composition with existing generalist agents is a notable strength. However, as a primarily descriptive demo without quantitative evaluation, its significance depends on future validation of the reliability of the prompt-based interventions.

major comments (1)
  1. Abstract: The central claim that the system delivers 'predictable, auditable, and compliance-aware behavior' and improves 'policy adherence and execution consistency' is load-bearing but unsupported by any metrics, error analysis, or adherence rates from the healthcare scenario. The description of the five checkpoints relies on the assumption that the generalist LLM will consistently follow the injected constraints, yet no evidence or analysis is provided to address known issues like instruction drift in long workflows.
minor comments (1)
  1. The manuscript would benefit from clearer notation or diagrams illustrating the flow through the five checkpoints to aid reader understanding of the runtime architecture.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for recognizing the practical importance of embedding governance into generalist agent pipelines and for the constructive feedback on empirical support. We agree that the current work is a descriptive demonstration of the CUGA architecture rather than a quantitative study, and we will revise the manuscript to qualify claims and address limitations explicitly.

read point-by-point responses
  1. Referee: Abstract: The central claim that the system delivers 'predictable, auditable, and compliance-aware behavior' and improves 'policy adherence and execution consistency' is load-bearing but unsupported by any metrics, error analysis, or adherence rates from the healthcare scenario. The description of the five checkpoints relies on the assumption that the generalist LLM will consistently follow the injected constraints, yet no evidence or analysis is provided to address known issues like instruction drift in long workflows.

    Authors: We acknowledge that the abstract presents the intended outcomes of the architecture in strong terms without accompanying quantitative metrics or error analysis from the healthcare demo. This manuscript is positioned as an architectural demonstration illustrating how typed policy primitives can be composed at five structural checkpoints; the scenario shows dynamic playbook injection, intent blocking, and human-in-the-loop approvals in a qualitative setting. The multi-stage design is intended to reduce dependence on any single prompt by adding external enforcement layers (Tool Approvals, Output Formatter) that operate outside the LLM reasoning loop. Nevertheless, we agree that explicit discussion of instruction drift and other LLM reliability issues is warranted. We will revise the abstract to frame the benefits as 'designed to enable' rather than 'delivers,' qualify the healthcare example as illustrative, and add a dedicated limitations subsection that discusses prompt-based intervention reliability, potential drift in extended workflows, and the need for future quantitative evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: system design description without derivations or self-referential claims

full rationale

The manuscript presents CUGA as a modular policy-as-code architecture that intercepts generalist LLM agents at five named checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter). No equations, fitted parameters, predictions, or first-principles derivations appear anywhere in the provided text. All claims rest on descriptive system composition and a qualitative healthcare scenario rather than any reduction of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is therefore self-contained as an engineering artifact; the absence of a derivation chain precludes any circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about LLM agent behavior and introduces no free parameters, new entities, or non-standard axioms beyond the described policy components.

axioms (1)
  • domain assumption Generalist LLM agents can be effectively steered and constrained through external runtime interventions at planning, prompting, tool, approval, and output stages.
    This assumption underpins the claim that the five checkpoints deliver governance without model fine-tuning.

pith-pipeline@v0.9.0 · 5818 in / 1182 out tokens · 35908 ms · 2026-05-21T04:48:54.434453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Zhaorun Chen, Mintong Kang, and Bo Li. 2025. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738(2025)

  2. [2]

    CUGA Project. 2026. CUGA: Computer-Using Generalist Agent. https://github. com/cuga-project/cuga-agent. Accessed: 2026-04-27

  3. [3]

    CUGA Project. 2026. OAK Bench: Customer-Care Benchmark for Insurance Workflows. https://github.com/cuga-project/oak-bench. Accessed: 2026-04-27

  4. [4]

    Suyash Gaurav, Jukka Heikkonen, and Jatin Chaudhary. 2025. Governance- as-a-service: A multi-agent framework for ai system compliance and policy enforcement.arXiv preprint arXiv:2508.18765(2025)

  5. [5]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

  6. [6]

    InThe twelfth international conference on learning representations

    MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

  7. [7]

    IBM Research. 2026. BPO-Bench: Business Process Operations Benchmark. https: //huggingface.co/datasets/ibm-research/BPO-Bench. Accessed: 2026-04-27

  8. [8]

    Changyue Jiang, Xudong Pan, and Min Yang. 2025. Think twice before you act: Enhancing agent behavioral safety with thought correction.arXiv preprint arXiv:2505.11063(2025)

  9. [9]

    Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov

  10. [10]

    St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703(2024)

  11. [11]

    Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. 2025. Agentauditor: Human-level safety and security evaluation for llm agents.arXiv preprint arXiv:2506.00641(2025)

  12. [12]

    Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. 2026. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security8, 3-4 (2026), 1–240

  13. [13]

    Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif. 2025. Towards enterprise-ready computer using generalist agent.arXiv preprint arXiv:2503.01861(2025)

  14. [14]

    Alon Oved, Segev Shlomov, Sergey Zeltyn, Nir Mashkif, and Avi Yaeli. 2025. SNAP: semantic stories for next activity prediction. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28871–28877

  15. [15]

    Parlant. 2025. Parlant: Conversational Control Layer for Customer-Facing LLM Agents. https://parlant.io/ and https://github.com/emcie-co/parlant. Accessed March 2026

  16. [16]

    Sivan Schwartz, Avi Yaeli, and Segev Shlomov. 2023. Enhancing trust in LLM- based AI automation agents: New considerations and future challenges.arXiv preprint arXiv:2308.05391(2023)

  17. [17]

    Yucheng Shi, Wenhao Yu, Jingyuan Huang, Wenlin Yao, Wenhu Chen, and Ninghao Liu. 2025. Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434(2025)

  18. [18]

    Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, et al. 2026. From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enter- prise Production. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 40423–40431

  19. [19]

    Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, and Roy Abit- bol. 2025. From Grounding to Planning: Benchmarking Bottlenecks in Web Agents. InECAI 2025 – 28th European Conference on Artificial Intelligence (Fron- tiers in Artificial Intelligence and Applications, Vol. 413). IOS Press, 4815–4822. arXiv:2409.01927 [cs.AI]

  20. [20]

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian

  21. [21]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Appworld: A controllable world of apps and people for benchmarking inter- active coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16022–16076

  22. [22]

    Lillian Tsai and Eugene Bagdasarian. 2025. Contextual agent security: A policy for every purpose. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems. 8–17

  23. [23]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  24. [24]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems33 (2020), 5776–5788

  25. [25]

    Avi Yaeli, Segev Shlomov, Alon Oved, Sergey Zeltyn, and Nir Mashkif. 2022. Recommending next best skill in conversational robotic process automation. In International Conference on Business Process Management. Springer, 215–230

  26. [26]

    Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, and Jun Zhu. 2025. Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments.arXiv preprint arXiv:2506.01616(2025)

  27. [27]

    Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, et al . 2025. Securewebarena: A holistic security evaluation benchmark for lvlm-based web agents.arXiv preprint arXiv:2510.10073(2025)

  28. [28]

    Sergey Zeltyn, Segev Shlomov, Avi Yaeli, and Alon Oved. 2022. Prescriptive process monitoring in intelligent process automation with chatbot orchestration. arXiv preprint arXiv:2212.06564(2022)

  29. [29]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854(2023)

  30. [30]

    total requi- sitions used for computation

    Naama Zwerdling, David Boaz, Ella Rabinovich, Guy Uziel, David Amid, and Ateret Anaby Tavor. 2025. Towards Enforcing Company Policy Adherence in Agentic Workflows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 595–606. A Ablation Study: BPO Benchmark This appendix provides the complete experiment...