pith. sign in

arxiv: 2606.26649 · v1 · pith:NO5R5RDAnew · submitted 2026-06-25 · 💻 cs.AI · cs.CR

Autoformalization of Agent Instructions into Policy-as-Code

Pith reviewed 2026-06-26 05:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords autoformalizationpolicy-as-codeCedar Policy Languageagent safetyLLM generator-critic loopMedAgentBenchformal verification
0
0 comments X

The pith

An LLM generator-critic loop translates natural language agent instructions and policy documents into Cedar policies that cover substantially more of the source specification than hand-coded symbolic enforcement on MedAgentBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an autoformalization pipeline that converts agent prompts, tool descriptions, and natural language policy documents into Cedar policies through an iterative LLM generator-critic process. This targets the limitations of probabilistic guardrails that lack formal guarantees and hand-coded symbolic methods that cannot scale to complex real-world specifications. A sympathetic reader would care because the approach aims to make formal policy enforcement practical for high-stakes agent applications by directly deriving verifiable rules from existing documents. The evaluation focuses on MedAgentBench where the resulting policies achieve higher coverage of the original natural language than prior hand-coded baselines.

Core claim

The paper claims that an LLM-based generator-critic loop can autoformalize agent instructions, MCP tool descriptions, and natural language policy documents into Cedar Policy Language code, and that on the MedAgentBench benchmark these autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement used in prior work.

What carries the argument

The LLM generator-critic loop that iteratively produces and refines Cedar policies from natural language inputs until they pass verification checks.

If this is right

  • Formal enforcement of agent policies can scale beyond manually written rules to match the breadth of real policy documents.
  • Agent systems in domains such as medicine gain access to verifiable policies derived directly from existing natural language sources.
  • The separation between probabilistic steering and formal guarantees narrows because the output policies carry machine-checkable semantics.
  • Policy updates can propagate from natural language revisions through the same pipeline without requiring new manual coding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generator-critic structure could be applied to other formal policy languages if their syntax and semantics are provided as context.
  • Consistency checks across multiple source documents might become feasible by running the pipeline on combined inputs.
  • Deployment would still require runtime integration of the Cedar engine with the agent execution environment.

Load-bearing premise

The generator-critic loop produces Cedar policies whose semantics fully match the intent and coverage of the original natural language documents without systematic omissions or misinterpretations.

What would settle it

A side-by-side audit on MedAgentBench that shows the autoformalized Cedar policies miss or misrepresent key rules present in the source natural language documents while hand-coded versions do not.

Figures

Figures reproduced from arXiv: 2606.26649 by Adam Mondl, John H. Brock, Matthew Maisel.

Figure 1
Figure 1. Figure 1: The policy generation pipeline. A system prompt, MCP tool definitions, and an (unstructured) policy corpus are autoformalized into a verified Cedar policy set by a generator-critic loop that pairs a hard, deterministic critic (Cedar parser checks for syntax, schema mismatches, contradictions, and vacuous policies) with a soft critic (an LLM-as-judge doing semantic alignment and qualitative evaluation again… view at source ↗
read the original abstract

Agent safety in high-stakes domains requires formal policy enforcement, but most existing approaches either rely on probabilistic guardrails (fine-tuned classifiers, prompt-based steering) that offer no formal guarantees, or on hand-coded symbolic enforcement that does not scale to the breadth of real policy specifications. We present an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural language policy documents into formally verified policies using an LLM-based generator-critic loop. The resulting policies are written in the Cedar Policy Language. On the MedAgentBench benchmark, our autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement in prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural-language policy documents into Cedar policies via an LLM-based generator-critic loop. The central empirical claim is that, on the MedAgentBench benchmark, the resulting autoformalized policies cover substantially more of the source natural-language specifications than the hand-coded symbolic enforcement used in prior work.

Significance. If the coverage advantage can be demonstrated with an objective, reproducible metric and evidence that the LLM loop preserves semantic fidelity without systematic omissions, the work would offer a practical route to scaling formal policy enforcement beyond hand-coded baselines. The choice of Cedar as the target language is a positive feature, as it supports potential machine-checked properties.

major comments (2)
  1. [Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.
  2. [Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by stating the number of policies or documents evaluated on MedAgentBench.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing where revisions are needed to strengthen the empirical claims and evaluation description.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.

    Authors: We agree the abstract presents the coverage claim without supporting numbers or metric definition. The full evaluation on MedAgentBench reports concrete coverage improvements over the hand-coded baseline along with verification success rates; we will revise the abstract to include these quantitative deltas, a brief definition of coverage (fraction of source NL constraints expressible and enforced), and reference to the error analysis section. revision: yes

  2. Referee: [Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.

    Authors: The generator-critic loop relies on LLM self-critique rather than an external oracle, which is a methodological choice we will now explicitly discuss as a limitation. We will add a new subsection describing how coverage was measured (via systematic mapping of NL statements to policy statements) and report inter-annotator agreement on a held-out sample of 50 policy-NL pairs to help isolate artifacts. A fully independent non-LLM oracle for open-ended NL policies is not currently feasible and is not claimed. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison only

full rationale

The paper describes an LLM-based generator-critic pipeline for translating natural-language agent instructions into Cedar policies and reports an empirical coverage comparison on the external MedAgentBench benchmark against prior hand-coded baselines. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or self-citations appear in the load-bearing claims. The result is a direct measurement of coverage rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an applied pipeline resting on existing LLM capabilities and the Cedar language.

pith-pipeline@v0.9.1-grok · 5632 in / 989 out tokens · 26187 ms · 2026-06-26T05:04:14.910353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Agentguardian: Learning access control policies to govern ai agent behavior, 2026

    Abaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., and Shabtai, A. Agentguardian: Learning access control policies to govern ai agent behavior, 2026. URL https://arxiv.org/abs/2601.10440

  2. [2]

    Writing policies in natural language

    Amazon Web Services . Writing policies in natural language. AWS Bedrock AgentCore Documentation, 2025. URL https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-natural-language.html. Accessed: 2026-02-01

  3. [3]

    and Katoen, J.-P

    Baier, C. and Katoen, J.-P. Principles of Model Checking. The MIT Press, Cambridge, MA, USA, 2008. ISBN 9780262026499

  4. [4]

    W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A

    Cutler, J. W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A. Cedar: A new language for expressive, fast, safe, and analyzable authorization (extended version), 2024. URL https://arxiv.org/abs/2403.04651

  5. [5]

    de Moura, S

    de Moura, L. and Ullrich, S. The lean 4 theorem prover and programming language. In Automated Deduction -- CADE 28, volume 12699 of Lecture Notes in Computer Science, pp.\ 625--635. Springer, Cham, 2021. doi:10.1007/978-3-030-79876-5_37. URL https://link.springer.com/chapter/10.1007/978-3-030-79876-5_37

  6. [6]

    Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

    Hong, Y., She, Y., Kang, E., Timperley, C. S., and Kästner, C. Symbolic guardrails for domain-specific agents: Stronger safety and security guarantees without sacrificing utility, 2026. URL https://arxiv.org/abs/2604.15579

  7. [7]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: LLM -based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

  8. [8]

    C., Geng, G., Park, D., Zou, J., Ng, A

    Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., and Chen, J. H. MedAgentBench : A virtual EHR environment to benchmark medical LLM agents. NEJM AI, 2 0 (9): 0 AIdbp2500144, 2025

  9. [9]

    Llms can't plan, but can help planning in llm-modulo frameworks, 2024

    Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A. Llms can't plan, but can help planning in llm-modulo frameworks, 2024. URL https://arxiv.org/abs/2402.01817

  10. [10]

    and Smith, I

    Kaoudis, K. and Smith, I. Policy language security comparison and threat model. Technical report, Trail of Bits, 2024. URL https://github.com/trailofbits/publications/blob/master/reports/Policy_Language_Security_Comparison_and_TM.pdf. Comparative security assessment of policy languages (Cedar, Rego, OpenFGA)

  11. [11]

    and Ahuja, R

    Kholkar, G. and Ahuja, R. Policy-as-prompt: Turning ai governance rules into guardrails for ai agents, 2025. URL https://arxiv.org/abs/2509.23994

  12. [12]

    Contextual bandits with entropy-based human feedback, 2025

    Seraj, R., Meng, L., and Sylvain, T. Contextual bandits with entropy-based human feedback, 2025. URL https://arxiv.org/abs/2502.08759

  13. [13]

    Contextual Agent Security: A Policy for Every Purpose

    Tsai, L. and Bagdasarian, E. Contextual agent security: A policy for every purpose, 2025. URL https://arxiv.org/abs/2501.17070

  14. [14]

    Autoformalization in the era of large language models: A survey, 2025

    Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., and Zhang, T. Autoformalization in the era of large language models: A survey, 2025. URL https://arxiv.org/abs/2505.23486