Autoformalization of Agent Instructions into Policy-as-Code
Pith reviewed 2026-06-26 05:04 UTC · model grok-4.3
The pith
An LLM generator-critic loop translates natural language agent instructions and policy documents into Cedar policies that cover substantially more of the source specification than hand-coded symbolic enforcement on MedAgentBench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an LLM-based generator-critic loop can autoformalize agent instructions, MCP tool descriptions, and natural language policy documents into Cedar Policy Language code, and that on the MedAgentBench benchmark these autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement used in prior work.
What carries the argument
The LLM generator-critic loop that iteratively produces and refines Cedar policies from natural language inputs until they pass verification checks.
If this is right
- Formal enforcement of agent policies can scale beyond manually written rules to match the breadth of real policy documents.
- Agent systems in domains such as medicine gain access to verifiable policies derived directly from existing natural language sources.
- The separation between probabilistic steering and formal guarantees narrows because the output policies carry machine-checkable semantics.
- Policy updates can propagate from natural language revisions through the same pipeline without requiring new manual coding.
Where Pith is reading between the lines
- The same generator-critic structure could be applied to other formal policy languages if their syntax and semantics are provided as context.
- Consistency checks across multiple source documents might become feasible by running the pipeline on combined inputs.
- Deployment would still require runtime integration of the Cedar engine with the agent execution environment.
Load-bearing premise
The generator-critic loop produces Cedar policies whose semantics fully match the intent and coverage of the original natural language documents without systematic omissions or misinterpretations.
What would settle it
A side-by-side audit on MedAgentBench that shows the autoformalized Cedar policies miss or misrepresent key rules present in the source natural language documents while hand-coded versions do not.
Figures
read the original abstract
Agent safety in high-stakes domains requires formal policy enforcement, but most existing approaches either rely on probabilistic guardrails (fine-tuned classifiers, prompt-based steering) that offer no formal guarantees, or on hand-coded symbolic enforcement that does not scale to the breadth of real policy specifications. We present an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural language policy documents into formally verified policies using an LLM-based generator-critic loop. The resulting policies are written in the Cedar Policy Language. On the MedAgentBench benchmark, our autoformalized policies cover substantially more of the source natural-language specification than the hand-coded symbolic enforcement in prior work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an autoformalization pipeline that translates agent prompts, MCP tool descriptions, and natural-language policy documents into Cedar policies via an LLM-based generator-critic loop. The central empirical claim is that, on the MedAgentBench benchmark, the resulting autoformalized policies cover substantially more of the source natural-language specifications than the hand-coded symbolic enforcement used in prior work.
Significance. If the coverage advantage can be demonstrated with an objective, reproducible metric and evidence that the LLM loop preserves semantic fidelity without systematic omissions, the work would offer a practical route to scaling formal policy enforcement beyond hand-coded baselines. The choice of Cedar as the target language is a positive feature, as it supports potential machine-checked properties.
major comments (2)
- [Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.
- [Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.
minor comments (1)
- [Abstract] The abstract would be strengthened by stating the number of policies or documents evaluated on MedAgentBench.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, agreeing where revisions are needed to strengthen the empirical claims and evaluation description.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that autoformalized policies 'cover substantially more' supplies no quantitative deltas, verification success rates, error analysis, or definition of how coverage was measured against the source NL documents. The central empirical assertion therefore rests on an unevaluated statement.
Authors: We agree the abstract presents the coverage claim without supporting numbers or metric definition. The full evaluation on MedAgentBench reports concrete coverage improvements over the hand-coded baseline along with verification success rates; we will revise the abstract to include these quantitative deltas, a brief definition of coverage (fraction of source NL constraints expressible and enforced), and reference to the error analysis section. revision: yes
-
Referee: [Method] Method section (generator-critic loop): no external formal verification step or non-LLM coverage oracle is described. Without such an oracle or inter-annotator agreement on held-out policy-NL pairs, any reported coverage gain cannot be isolated from possible LLM-mediated evaluation artifacts or distortions of the original specifications.
Authors: The generator-critic loop relies on LLM self-critique rather than an external oracle, which is a methodological choice we will now explicitly discuss as a limitation. We will add a new subsection describing how coverage was measured (via systematic mapping of NL statements to policy statements) and report inter-annotator agreement on a held-out sample of 50 policy-NL pairs to help isolate artifacts. A fully independent non-LLM oracle for open-ended NL policies is not currently feasible and is not claimed. revision: partial
Circularity Check
No circularity: empirical benchmark comparison only
full rationale
The paper describes an LLM-based generator-critic pipeline for translating natural-language agent instructions into Cedar policies and reports an empirical coverage comparison on the external MedAgentBench benchmark against prior hand-coded baselines. No equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or self-citations appear in the load-bearing claims. The result is a direct measurement of coverage rather than any derivation that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agentguardian: Learning access control policies to govern ai agent behavior, 2026
Abaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., and Shabtai, A. Agentguardian: Learning access control policies to govern ai agent behavior, 2026. URL https://arxiv.org/abs/2601.10440
arXiv 2026
-
[2]
Writing policies in natural language
Amazon Web Services . Writing policies in natural language. AWS Bedrock AgentCore Documentation, 2025. URL https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-natural-language.html. Accessed: 2026-02-01
2025
-
[3]
and Katoen, J.-P
Baier, C. and Katoen, J.-P. Principles of Model Checking. The MIT Press, Cambridge, MA, USA, 2008. ISBN 9780262026499
2008
-
[4]
Cutler, J. W., Disselkoen, C., Eline, A., He, S., Headley, K., Hicks, M., Hietala, K., Ioannidis, E., Kastner, J., Mamat, A., McAdams, D., McCutchen, M., Rungta, N., Torlak, E., and Wells, A. Cedar: A new language for expressive, fast, safe, and analyzable authorization (extended version), 2024. URL https://arxiv.org/abs/2403.04651
arXiv 2024
-
[5]
de Moura, L. and Ullrich, S. The lean 4 theorem prover and programming language. In Automated Deduction -- CADE 28, volume 12699 of Lecture Notes in Computer Science, pp.\ 625--635. Springer, Cham, 2021. doi:10.1007/978-3-030-79876-5_37. URL https://link.springer.com/chapter/10.1007/978-3-030-79876-5_37
-
[6]
Hong, Y., She, Y., Kang, E., Timperley, C. S., and Kästner, C. Symbolic guardrails for domain-specific agents: Stronger safety and security guarantees without sacrificing utility, 2026. URL https://arxiv.org/abs/2604.15579
Pith/arXiv arXiv 2026
-
[7]
Llama guard: LLM -based input-output safeguard for human-ai conversations, 2023
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: LLM -based input-output safeguard for human-ai conversations, 2023. URL https://arxiv.org/abs/2312.06674
Pith/arXiv arXiv 2023
-
[8]
C., Geng, G., Park, D., Zou, J., Ng, A
Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., and Chen, J. H. MedAgentBench : A virtual EHR environment to benchmark medical LLM agents. NEJM AI, 2 0 (9): 0 AIdbp2500144, 2025
2025
-
[9]
Llms can't plan, but can help planning in llm-modulo frameworks, 2024
Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A. Llms can't plan, but can help planning in llm-modulo frameworks, 2024. URL https://arxiv.org/abs/2402.01817
arXiv 2024
-
[10]
and Smith, I
Kaoudis, K. and Smith, I. Policy language security comparison and threat model. Technical report, Trail of Bits, 2024. URL https://github.com/trailofbits/publications/blob/master/reports/Policy_Language_Security_Comparison_and_TM.pdf. Comparative security assessment of policy languages (Cedar, Rego, OpenFGA)
2024
-
[11]
Kholkar, G. and Ahuja, R. Policy-as-prompt: Turning ai governance rules into guardrails for ai agents, 2025. URL https://arxiv.org/abs/2509.23994
arXiv 2025
-
[12]
Contextual bandits with entropy-based human feedback, 2025
Seraj, R., Meng, L., and Sylvain, T. Contextual bandits with entropy-based human feedback, 2025. URL https://arxiv.org/abs/2502.08759
arXiv 2025
-
[13]
Tsai, L. and Bagdasarian, E. Contextual agent security: A policy for every purpose, 2025. URL https://arxiv.org/abs/2501.17070
arXiv 2025
-
[14]
Autoformalization in the era of large language models: A survey, 2025
Weng, K., Du, L., Li, S., Lu, W., Sun, H., Liu, H., and Zhang, T. Autoformalization in the era of large language models: A survey, 2025. URL https://arxiv.org/abs/2505.23486
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.