Autoformalization of Agent Instructions into Policy-as-Code

Adam Mondl; John H. Brock; Matthew Maisel

arxiv: 2606.26649 · v1 · pith:NO5R5RDAnew · submitted 2026-06-25 · 💻 cs.AI · cs.CR

Autoformalization of Agent Instructions into Policy-as-Code

Adam Mondl , Matthew Maisel , John H. Brock This is my paper

Pith reviewed 2026-06-26 05:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords autoformalizationpolicy-as-codeCedar Policy Languageagent safetyLLM generator-critic loopMedAgentBenchformal verification

0 comments

The pith

An LLM generator-critic loop translates natural language agent instructions and policy documents into Cedar policies that cover substantially more of the source specification than hand-coded symbolic enforcement on MedAgentBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an autoformalization pipeline that converts agent prompts, tool descriptions, and natural language policy documents into Cedar policies through an iterative LLM generator-critic process. This targets the limitations of probabilistic guardrails that lack formal guarantees and hand-coded symbolic methods that cannot scale to complex real-world specifications. A sympathetic reader would care because the approach aims to make formal policy enforcement practical for high-stakes agent applications by directly deriving verifiable rules from existing documents. The evaluation focuses on MedAgentBench where the resulting policies achieve higher coverage of the original natural language than prior hand-coded baselines.

Core claim

The paper claims that an LLM-based generator-critic loop can autoformalize agent instructions, MCP tool descriptions, and natural language policy documents into Cedar Policy Language code, and that on the MedAgentBench benchmark these autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement used in prior work.

What carries the argument

The LLM generator-critic loop that iteratively produces and refines Cedar policies from natural language inputs until they pass verification checks.

If this is right

Formal enforcement of agent policies can scale beyond manually written rules to match the breadth of real policy documents.
Agent systems in domains such as medicine gain access to verifiable policies derived directly from existing natural language sources.
The separation between probabilistic steering and formal guarantees narrows because the output policies carry machine-checkable semantics.
Policy updates can propagate from natural language revisions through the same pipeline without requiring new manual coding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generator-critic structure could be applied to other formal policy languages if their syntax and semantics are provided as context.
Consistency checks across multiple source documents might become feasible by running the pipeline on combined inputs.
Deployment would still require runtime integration of the Cedar engine with the agent execution environment.

Load-bearing premise

The generator-critic loop produces Cedar policies whose semantics fully match the intent and coverage of the original natural language documents without systematic omissions or misinterpretations.

What would settle it

A side-by-side audit on MedAgentBench that shows the autoformalized Cedar policies miss or misrepresent key rules present in the source natural language documents while hand-coded versions do not.

Figures

Figures reproduced from arXiv: 2606.26649 by Adam Mondl, John H. Brock, Matthew Maisel.

**Figure 1.** Figure 1: The policy generation pipeline. A system prompt, MCP tool definitions, and an (unstructured) policy corpus are autoformalized into a verified Cedar policy set by a generator-critic loop that pairs a hard, deterministic critic (Cedar parser checks for syntax, schema mismatches, contradictions, and vacuous policies) with a soft critic (an LLM-as-judge doing semantic alignment and qualitative evaluation again… view at source ↗

read the original abstract

Agent safety in high-stakes domains requires formal policy enforcement, but most existing approaches either rely on probabilistic guardrails (fine-tuned classifiers, prompt-based steering) that offer no formal guarantees, or on hand-coded symbolic enforcement that does not scale to the breadth of real policy specifications. We present an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural language policy documents into formally verified policies using an LLM-based generator-critic loop. The resulting policies are written in the Cedar Policy Language. On the MedAgentBench benchmark, our autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement in prior work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The generator-critic loop for Cedar policies from NL documents is a concrete pipeline but the coverage claim on MedAgentBench rests on an unevaluated assertion with no numbers or independent metric.

read the letter

The paper's core contribution is a generator-critic loop that takes agent prompts, tool descriptions, and natural-language policy text and produces Cedar policies. This targets the practical gap between loose probabilistic guardrails and brittle hand-coded symbolic rules in agent safety. The MedAgentBench case is a reasonable place to test it.

What stands out is the choice of Cedar as the target language and the explicit loop structure. That gives a reproducible engineering artifact rather than another vague prompt-engineering claim. The abstract correctly notes that hand-coded baselines do not scale to broad policy documents.

The main weakness is the evaluation. The claim that the autoformalized policies cover substantially more of the source specification than prior hand-coded enforcement is stated without any quantitative deltas, success rates, error breakdown, or definition of how coverage was scored. The stress-test concern holds: without a non-LLM coverage oracle or held-out human agreement data, any reported advantage could be an artifact of the generation process itself. There is also no mention of formal verification of the output policies, which undercuts the safety angle.

This work is aimed at researchers building agent systems who need policy enforcement that can be checked. A reader already working on autoformalization or Cedar will find the pipeline description useful as a starting point. The paper deserves a serious referee because the underlying idea is grounded in a real deployment need and the method is falsifiable once the metrics are supplied; the current draft simply does not yet supply them.

Referee Report

2 major / 1 minor

Summary. The paper introduces an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural-language policy documents into Cedar policies via an LLM-based generator-critic loop. The central empirical claim is that, on the MedAgentBench benchmark, the resulting autoformalized policies cover substantially more of the source natural-language specifications than the hand-coded symbolic enforcement used in prior work.

Significance. If the coverage advantage can be demonstrated with an objective, reproducible metric and evidence that the LLM loop preserves semantic fidelity without systematic omissions, the work would offer a practical route to scaling formal policy enforcement beyond hand-coded baselines. The choice of Cedar as the target language is a positive feature, as it supports potential machine-checked properties.

major comments (2)

[Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.
[Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.

minor comments (1)

[Abstract] The abstract would be strengthened by stating the number of policies or documents evaluated on MedAgentBench.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing where revisions are needed to strengthen the empirical claims and evaluation description.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.

Authors: We agree the abstract presents the coverage claim without supporting numbers or metric definition. The full evaluation on MedAgentBench reports concrete coverage improvements over the hand-coded baseline along with verification success rates; we will revise the abstract to include these quantitative deltas, a brief definition of coverage (fraction of source NL constraints expressible and enforced), and reference to the error analysis section. revision: yes
Referee: [Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.

Authors: The generator-critic loop relies on LLM self-critique rather than an external oracle, which is a methodological choice we will now explicitly discuss as a limitation. We will add a new subsection describing how coverage was measured (via systematic mapping of NL statements to policy statements) and report inter-annotator agreement on a held-out sample of 50 policy-NL pairs to help isolate artifacts. A fully independent non-LLM oracle for open-ended NL policies is not currently feasible and is not claimed. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison only

full rationale

The paper describes an LLM-based generator-critic pipeline for translating natural-language agent instructions into Cedar policies and reports an empirical coverage comparison on the external MedAgentBench benchmark against prior hand-coded baselines. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or self-citations appear in the load-bearing claims. The result is a direct measurement of coverage rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is an applied pipeline resting on existing LLM capabilities and the Cedar language.

pith-pipeline@v0.9.1-grok · 5632 in / 989 out tokens · 26187 ms · 2026-06-26T05:04:14.910353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Agentguardian: Learning access control policies to govern ai agent behavior, 2026

Abaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., and Shabtai, A. Agentguardian: Learning access control policies to govern ai agent behavior, 2026. URL https://arxiv.org/abs/2601.10440

work page arXiv 2026
[2]

Writing policies in natural language

Amazon Web Services . Writing policies in natural language. AWS Bedrock AgentCore Documentation, 2025. URL https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-natural-language.html. Accessed: 2026-02-01

2025
[3]

and Katoen, J.-P

Baier, C. and Katoen, J.-P. Principles of Model Checking. The MIT Press, Cambridge, MA, USA, 2008. ISBN 9780262026499

2008
[4]

W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A

Cutler, J. W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A. Cedar: A new language for expressive, fast, safe, and analyzable authorization (extended version), 2024. URL https://arxiv.org/abs/2403.04651

work page arXiv 2024
[5]

de Moura, S

de Moura, L. and Ullrich, S. The lean 4 theorem prover and programming language. In Automated Deduction -- CADE 28, volume 12699 of Lecture Notes in Computer Science, pp.\ 625--635. Springer, Cham, 2021. doi:10.1007/978-3-030-79876-5_37. URL https://link.springer.com/chapter/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[6]

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Hong, Y., She, Y., Kang, E., Timperley, C. S., and Kästner, C. Symbolic guardrails for domain-specific agents: Stronger safety and security guarantees without sacrificing utility, 2026. URL https://arxiv.org/abs/2604.15579

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: LLM -based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

C., Geng, G., Park, D., Zou, J., Ng, A

Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., and Chen, J. H. MedAgentBench : A virtual EHR environment to benchmark medical LLM agents. NEJM AI, 2 0 (9): 0 AIdbp2500144, 2025

2025
[9]

Llms can't plan, but can help planning in llm-modulo frameworks, 2024

Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A. Llms can't plan, but can help planning in llm-modulo frameworks, 2024. URL https://arxiv.org/abs/2402.01817

work page arXiv 2024
[10]

and Smith, I

Kaoudis, K. and Smith, I. Policy language security comparison and threat model. Technical report, Trail of Bits, 2024. URL https://github.com/trailofbits/publications/blob/master/reports/Policy_Language_Security_Comparison_and_TM.pdf. Comparative security assessment of policy languages (Cedar, Rego, OpenFGA)

2024
[11]

and Ahuja, R

Kholkar, G. and Ahuja, R. Policy-as-prompt: Turning ai governance rules into guardrails for ai agents, 2025. URL https://arxiv.org/abs/2509.23994

work page arXiv 2025
[12]

Contextual bandits with entropy-based human feedback, 2025

Seraj, R., Meng, L., and Sylvain, T. Contextual bandits with entropy-based human feedback, 2025. URL https://arxiv.org/abs/2502.08759

work page arXiv 2025
[13]

Contextual Agent Security: A Policy for Every Purpose

Tsai, L. and Bagdasarian, E. Contextual agent security: A policy for every purpose, 2025. URL https://arxiv.org/abs/2501.17070

work page arXiv 2025
[14]

Autoformalization in the era of large language models: A survey, 2025

Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., and Zhang, T. Autoformalization in the era of large language models: A survey, 2025. URL https://arxiv.org/abs/2505.23486

work page arXiv 2025

[1] [1]

Agentguardian: Learning access control policies to govern ai agent behavior, 2026

Abaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., and Shabtai, A. Agentguardian: Learning access control policies to govern ai agent behavior, 2026. URL https://arxiv.org/abs/2601.10440

work page arXiv 2026

[2] [2]

Writing policies in natural language

Amazon Web Services . Writing policies in natural language. AWS Bedrock AgentCore Documentation, 2025. URL https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-natural-language.html. Accessed: 2026-02-01

2025

[3] [3]

and Katoen, J.-P

Baier, C. and Katoen, J.-P. Principles of Model Checking. The MIT Press, Cambridge, MA, USA, 2008. ISBN 9780262026499

2008

[4] [4]

W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A

Cutler, J. W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A. Cedar: A new language for expressive, fast, safe, and analyzable authorization (extended version), 2024. URL https://arxiv.org/abs/2403.04651

work page arXiv 2024

[5] [5]

de Moura, S

de Moura, L. and Ullrich, S. The lean 4 theorem prover and programming language. In Automated Deduction -- CADE 28, volume 12699 of Lecture Notes in Computer Science, pp.\ 625--635. Springer, Cham, 2021. doi:10.1007/978-3-030-79876-5_37. URL https://link.springer.com/chapter/10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021

[6] [6]

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Hong, Y., She, Y., Kang, E., Timperley, C. S., and Kästner, C. Symbolic guardrails for domain-specific agents: Stronger safety and security guarantees without sacrificing utility, 2026. URL https://arxiv.org/abs/2604.15579

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: LLM -based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

C., Geng, G., Park, D., Zou, J., Ng, A

Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., and Chen, J. H. MedAgentBench : A virtual EHR environment to benchmark medical LLM agents. NEJM AI, 2 0 (9): 0 AIdbp2500144, 2025

2025

[9] [9]

Llms can't plan, but can help planning in llm-modulo frameworks, 2024

Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A. Llms can't plan, but can help planning in llm-modulo frameworks, 2024. URL https://arxiv.org/abs/2402.01817

work page arXiv 2024

[10] [10]

and Smith, I

Kaoudis, K. and Smith, I. Policy language security comparison and threat model. Technical report, Trail of Bits, 2024. URL https://github.com/trailofbits/publications/blob/master/reports/Policy_Language_Security_Comparison_and_TM.pdf. Comparative security assessment of policy languages (Cedar, Rego, OpenFGA)

2024

[11] [11]

and Ahuja, R

Kholkar, G. and Ahuja, R. Policy-as-prompt: Turning ai governance rules into guardrails for ai agents, 2025. URL https://arxiv.org/abs/2509.23994

work page arXiv 2025

[12] [12]

Contextual bandits with entropy-based human feedback, 2025

Seraj, R., Meng, L., and Sylvain, T. Contextual bandits with entropy-based human feedback, 2025. URL https://arxiv.org/abs/2502.08759

work page arXiv 2025

[13] [13]

Contextual Agent Security: A Policy for Every Purpose

Tsai, L. and Bagdasarian, E. Contextual agent security: A policy for every purpose, 2025. URL https://arxiv.org/abs/2501.17070

work page arXiv 2025

[14] [14]

Autoformalization in the era of large language models: A survey, 2025

Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., and Zhang, T. Autoformalization in the era of large language models: A survey, 2025. URL https://arxiv.org/abs/2505.23486

work page arXiv 2025