pith. sign in

arxiv: 2606.32004 · v1 · pith:ASBLNU64new · submitted 2026-06-30 · 💻 cs.AI · cs.LG· cs.LO· cs.SC

PolicyGuard: From Organizational Policies to Neuro-SymbolicCompliance Review Engines

Pith reviewed 2026-07-01 05:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.LOcs.SC
keywords neuro-symbolic frameworkpolicy compliancedocument reviewlogic ruleslarge language modelsNDA compliancesymbolic evaluation
0
0 comments X

The pith

PolicyGuard converts organizational policies into typed logic rules and local extraction questions so LLMs and symbolic evaluation can check document compliance explicitly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PolicyGuard as a way to handle policy-grounded document review by turning policies into an executable engine of relational logic rules paired with atom-level questions. Large language models answer only the local questions using retrieved document passages while a symbolic component applies the rules to decide compliance. This separation replaces opaque end-to-end prompting with steps that can be inspected, revised, and tested independently. The approach is demonstrated on company-specific NDA contract reviews where clauses are checked against negotiation policies.

Core claim

PolicyGuard converts organizational policy guidance into an executable review engine consisting of typed relational logic rules and atom-level extraction questions. During review, LLMs answer these local questions using retrieved document evidence, and a symbolic evaluator applies the formal rules to detect non-compliance. By separating policy formalization, local document interpretation, and symbolic compliance evaluation, the framework makes document review more explicit, maintainable, and systematically testable.

What carries the argument

The executable review engine built from typed relational logic rules and atom-level extraction questions, with LLMs limited to answering the questions and a symbolic evaluator applying the rules.

If this is right

  • Compliance decisions become inspectable by tracing which rules fired and which LLM answers supported them.
  • Policy changes require only editing the logic rules rather than retraining or rewriting prompts.
  • Individual components can be tested and debugged separately instead of only evaluating final outputs.
  • The same engine structure can be reused across different organizational policies by swapping the rule set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be applied to other high-stakes document checks such as regulatory filings or internal audit reports.
  • Explicit rules might allow organizations to maintain version control over their compliance logic in the same way they version code.
  • Combining the engine with human oversight loops could let reviewers focus only on disputed LLM answers rather than reading entire documents.

Load-bearing premise

Large language models can reliably and correctly answer the atom-level extraction questions when given retrieved document evidence.

What would settle it

A set of NDA documents where the LLM returns incorrect answers to one or more extraction questions, causing the symbolic evaluator to reach a compliance verdict that differs from expert human review of the same clauses.

Figures

Figures reproduced from arXiv: 2606.32004 by Amar Prakash Azad, Ayush Singh, Sameer Malik.

Figure 1
Figure 1. Figure 1: Overview of the PolicyGuard. Given organization-specific policy guidance and a target document, PolicyGuard constructs a review engine that formalizes policy conditions as typed rules and atom-level extraction questions. At review time, LLMs ground the required atoms in document evidence, while a symbolic evaluator applies the executable rules to produce an auditable compliance report. rule-relevant facts … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PolicyGuard. The figure shows the two-stage workflow: (A) offline construction of a policy-specific review engine from company policy guidance; and (B) online NDA review, where retrieved contract clauses are grounded into atom-level truth values and evaluated by the symbolic rule engine to produce an auditable risk report. studies show that legal LLM outputs remain vulnerable to reliability and… view at source ↗
Figure 3
Figure 3. Figure 3: Decision reliability over 10 repeated runs at temperature = 0. (Left) passˆk curves: zero-shot baselines drop 7–9 pp from k = 1 to k = 10, while PolicyGuard drops ∼1 pp. (Right) passˆ1 vs. passˆ10 shows PolicyGuard’s substantially higher consistency for enterprise non-compliance review. 4 Experiments We evaluate PolicyGuard on company-specific NDA compliance review: given a company policy guideline and a t… view at source ↗
Figure 4
Figure 4. Figure 4: Per-contract reliability under ten repeated inference runs at temperature zero. pass^1 measures single-run correctness, while pass^10 requires all ten runs for a policy–contract pair to be correct. Zero-shot baselines show larger reliability drops across contracts, whereas PolicyGuard remains nearly stable, with the largest drop on NDA-3. Method NDA-1 NDA-2 NDA-3 NDA-4 NDA-5 P R P R P R P R P R Direct Prom… view at source ↗
read the original abstract

Policy-grounded document review requires determining whether a target document complies with organization-specific policies, guidelines, or playbooks. While large language models can assist with policy interpretation and document analysis, end-to-end prompting leaves the applied policy logic implicit, making compliance decisions difficult to inspect, update, and test. We present PolicyGuard, a neuro-symbolic framework for policy-grounded document compliance review. PolicyGuard converts organizational policy guidance into an executable review engine consisting of typed relational logic rules and atom-level extraction questions. During review, LLMs answer these local questions using retrieved document evidence, and a symbolic evaluator applies the formal rules to detect non-compliance. We instantiate and evaluate PolicyGuard on company-specific NDA compliance review, where contract clauses must be checked against organization-specific negotiation policies. By separating policy formalization, local document interpretation, and symbolic compliance evaluation, PolicyGuard makes document review more explicit, maintainable, and systematically testable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents PolicyGuard, a neuro-symbolic framework that converts organizational policy guidance into an executable review engine of typed relational logic rules and atom-level extraction questions. LLMs answer the local questions using retrieved document evidence, after which a symbolic evaluator applies the formal rules to detect non-compliance. The framework is instantiated on company-specific NDA compliance review, with the central claim that separating policy formalization, local document interpretation, and symbolic compliance evaluation makes document review more explicit, maintainable, and systematically testable.

Significance. If the empirical premise holds, the explicit separation of concerns would address the opacity of end-to-end LLM prompting for policy-grounded review, enabling more inspectable, updatable, and testable compliance engines in organizational settings. The neuro-symbolic design is a clear strength in principle for domains requiring auditability.

major comments (1)
  1. [Abstract] Abstract and high-level description: the claim that PolicyGuard makes review 'systematically testable' depends on reliable LLM answers to atom-level extraction questions, yet the manuscript supplies no quantitative results on extraction accuracy, inter-annotator agreement with human experts, or error analysis on the NDA task, leaving the load-bearing assumption unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important empirical gap. The point is well-taken and directly affects the strength of our central claim.

read point-by-point responses
  1. Referee: [Abstract] Abstract and high-level description: the claim that PolicyGuard makes review 'systematically testable' depends on reliable LLM answers to atom-level extraction questions, yet the manuscript supplies no quantitative results on extraction accuracy, inter-annotator agreement with human experts, or error analysis on the NDA task, leaving the load-bearing assumption unverified.

    Authors: We agree that the manuscript does not currently report quantitative metrics on the accuracy of the atom-level LLM extractions, inter-annotator agreement, or a dedicated error analysis for the NDA task. While the end-to-end compliance results provide indirect support, they do not isolate extraction reliability, which is indeed required to substantiate the claim that the framework enables systematic testing. In the revised manuscript we will add a new subsection under Evaluation that reports (i) precision, recall, and F1 of the LLM answers against human gold annotations on a held-out set of NDA clauses, (ii) inter-annotator agreement statistics, and (iii) a qualitative error analysis categorizing failure modes. This will directly verify the assumption underlying the testability claim. revision: yes

Circularity Check

0 steps flagged

No circularity; framework description contains no derivations or self-referential reductions

full rationale

The paper describes a neuro-symbolic framework separating policy formalization, local LLM-based interpretation via atom-level questions, and symbolic rule evaluation. No equations, parameters, or derivations are present in the provided text. The central claim that the separation makes review 'more explicit, maintainable, and systematically testable' is a descriptive assertion about system design, not a mathematical result that reduces to its inputs by construction. LLM reliability on extraction questions is an unverified empirical premise (as noted in the skeptic attack), but this is a correctness risk rather than circularity. No self-citations, ansatzes, or renamings of known results are invoked in a load-bearing way. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested domain assumption that LLMs will correctly answer the local extraction questions and that organizational policies can be losslessly encoded as typed relational logic rules.

axioms (2)
  • domain assumption LLMs can reliably answer atom-level extraction questions using retrieved document evidence.
    This is required for the local interpretation stage to produce usable inputs to the symbolic evaluator.
  • domain assumption Organizational policies can be converted into typed relational logic rules without significant loss of intended meaning.
    This is required for the policy formalization stage to produce an executable engine that matches human intent.
invented entities (1)
  • PolicyGuard review engine no independent evidence
    purpose: Executable combination of logic rules and extraction questions for compliance checking
    New system proposed by the authors; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5695 in / 1336 out tokens · 49091 ms · 2026-07-01T05:02:11.924923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  3. [3]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  4. [4]

    Dan Gusfield , title =. 1997

  5. [5]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  6. [6]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

  8. [8]

    Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

    Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction , author =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , publisher =. doi:10.1145/3594536.3595170 , url =

  9. [9]

    Advances in Neural Information Processing Systems , volume =

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , title =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

  10. [10]

    Findings of the Association for Computational Linguistics:

    Sergio Servantez and Joe Barrow and Kristian Hammond and Rajiv Jain , title =. Findings of the Association for Computational Linguistics:

  11. [11]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao and Noah Shinn and Pedram Razavi and Karthik Narasimhan , title =. arXiv preprint arXiv:2406.12045 , year =

  12. [12]

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing

    Nils Reimers and Iryna Gurevych , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019 , publisher =

  13. [13]

    2025 , howpublished =

    Introducing GPT-4.1 in the API , author =. 2025 , howpublished =

  14. [14]

    2025 , howpublished =

    Claude Sonnet 4.5 System Card , author =. 2025 , howpublished =

  15. [15]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  16. [16]

    2024 , howpublished =

    Llama 3.3 Model Card , author =. 2024 , howpublished =

  17. [17]

    Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference,

    Z3: An Efficient SMT Solver , author =. Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference,

  18. [18]

    Proceedings of the Workshop on Natural Legal Language Processing (NLLP 2022) , year =

    Fangyi Yu and Lee Quartey and Frank Schilder , title =. Proceedings of the Workshop on Natural Legal Language Processing (NLLP 2022) , year =

  19. [19]

    Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

    Cong Jiang and Xiaolei Yang , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , publisher =

  20. [20]

    Advances in Neural Information Processing Systems , volume =

    Takeshi Kojima and Shixiang Shane Gu and Machel Reid and Yutaka Matsuo and Yusuke Iwasawa , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  21. [21]

    Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer , booktitle =

  22. [22]

    , booktitle =

    Koreeda, Yuta and Manning, Christopher D. , booktitle =

  23. [23]

    Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. Advances in Neural Information Processing Systems , year =

  24. [24]

    Journal of Legal Analysis , volume =

    Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models , author =. Journal of Legal Analysis , volume =

  25. [25]

    2503.22738 , archivePrefix =

    Chen, Zhaorun and Kang, Mintong and Li, Bo , year =. 2503.22738 , archivePrefix =

  26. [26]

    Trends Priv

    Miculicich, Lesly and Parmar, Mihir and Palangi, Hamid and Dvijotham, Krishnamurthy Dj and Montanari, Mirko and Pfister, Tomas and Le, Long T. , year =. 2510.05156 , archivePrefix =

  27. [27]

    Publication Manual of the American Psychological Association , year =