pith. sign in

arxiv: 2604.27713 · v1 · submitted 2026-04-30 · 💻 cs.AI

Knowledge Graph Representations for LLM-Based Policy Compliance Reasoning

Pith reviewed 2026-05-07 05:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords knowledge graphslarge language modelspolicy complianceAI regulationsretrieval augmented generationontology schemasreasoning tasks
0
0 comments X

The pith

Knowledge graphs from policy documents improve large language models' accuracy on compliance reasoning tasks, with auto-discovered schemas matching formal ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that representing AI policy documents as knowledge graphs and retrieving from them helps large language models answer questions about policy compliance more effectively. If true, this would give a scalable way to check AI systems against safety regulations by turning unstructured policy text into structured data that models can query. The authors create graphs from three real AI risk policies using both a formal ontology and an LLM-generated open schema, then run five different models on 42 questions that test six kinds of reasoning from basic lookups to inferences across policies. They measure results with both rule-based scoring and an LLM judge, finding consistent gains from the graphs in every case and comparable or better results from the automatically discovered schema.

Core claim

Augmenting large language models with knowledge graphs built from AI policy documents improves performance on policy question-answering tasks. The gains appear across all five tested models and across tasks that require entity lookup, relation finding, and cross-policy inference. An open schema discovered by an LLM itself produces scores that match or exceed those from a hand-crafted formal ontology.

What carries the argument

The agentic framework that extracts policy text into knowledge graphs under a chosen schema and retrieves relevant subgraphs to support the language model's answers to compliance questions.

Load-bearing premise

The 42 policy QA tasks and the combination of heuristic scoring with LLM-as-judge evaluation faithfully measure genuine policy compliance reasoning without artifacts from task design or judge bias.

What would settle it

Evaluating the same KG-augmented models on a fresh set of policy compliance questions created independently and scored by human experts would determine whether the reported improvements reflect real capability or depend on the original task and scoring setup.

Figures

Figures reproduced from arXiv: 2604.27713 by Sepideh Ghanavati, Wilder Baldwin.

Figure 1
Figure 1. Figure 1: Agentic framework overview. Policy documents are segmented by chunking agent, structured view at source ↗
read the original abstract

The risks posed by AI features are increasing as they are rapidly integrated into software applications. In response, regulations and standards for safe and secure AI have been proposed. In this paper, we present an agentic framework that constructs knowledge graphs (KGs) from AI policy documents and retrieves policy-relevant information to answer questions. We build KGs from three AI risk-related polices under two ontology schemas, and then evaluate five LLMs on 42 policy QA tasks spanning six reasoning types, from entity lookup to cross-policy inference, using both heuristic scoring and an LLM-as-judge. KG augmentation improves scores for all five models, and an open, LLM-discovered schema matches or exceeds the formal ontology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an agentic framework that extracts knowledge graphs from three AI risk-related policy documents under both a formal ontology and an LLM-discovered schema. It then augments five LLMs with retrieved KG information and evaluates them on 42 policy QA tasks spanning six reasoning types (entity lookup to cross-policy inference), using both heuristic scoring and an LLM-as-judge. The central empirical claim is that KG augmentation improves performance across all five models and that the open LLM-discovered schema matches or exceeds the formal ontology.

Significance. If the results prove robust, the work offers a concrete demonstration that lightweight KG augmentation can improve LLM reasoning on regulatory compliance tasks without requiring fully expert-curated ontologies. The finding that an automatically induced schema performs competitively is noteworthy for scalability, as it suggests a path to handling evolving policy documents with reduced manual ontology engineering. The dual scoring approach and coverage of multiple reasoning types add value, though the small task set (42 items) and reliance on LLM judges limit immediate claims about real-world deployment.

major comments (2)
  1. [Evaluation] Evaluation section: The manuscript reports consistent score gains from KG augmentation but provides insufficient detail on the provenance and construction of the 42 tasks (e.g., whether they were human-authored, LLM-generated, or filtered to emphasize relational facts that graph retrieval naturally surfaces). Without an ablation on task source or explicit controls for selection bias, it remains unclear whether the observed deltas reflect genuine improvements in entity lookup and cross-policy inference or artifacts of task design.
  2. [Scoring] Scoring subsection: The paper employs both heuristic and LLM-as-judge scoring yet reports neither inter-annotator agreement for the judge nor any human validation of its decisions on a held-out subset. This is load-bearing for the central claim, because any systematic preference of the LLM judge for the more structured outputs produced by KG-augmented prompts could inflate the apparent benefit of KG augmentation and the competitiveness of the LLM-discovered schema.
minor comments (3)
  1. [Introduction] The abstract and introduction should explicitly name the three source policies and provide their citation details so readers can assess domain coverage.
  2. [Method] Figure 1 (framework diagram) would benefit from a more detailed caption that distinguishes the two ontology schemas and the exact retrieval mechanism used at inference time.
  3. [Conclusion] The manuscript should include a brief limitations paragraph addressing generalizability beyond the three policies and the five evaluated models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on task provenance and scoring validation. We have revised the manuscript to expand details on the 42 tasks and to include a human validation study for the LLM-as-judge, strengthening the empirical claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The manuscript reports consistent score gains from KG augmentation but provides insufficient detail on the provenance and construction of the 42 tasks (e.g., whether they were human-authored, LLM-generated, or filtered to emphasize relational facts that graph retrieval naturally surfaces). Without an ablation on task source or explicit controls for selection bias, it remains unclear whether the observed deltas reflect genuine improvements in entity lookup and cross-policy inference or artifacts of task design.

    Authors: We agree that additional transparency on task construction is warranted. In the revised manuscript, we have expanded Section 4.1 with a new paragraph and accompanying table that describe how the 42 tasks were developed: each task was manually formulated by the authors to instantiate one of the six reasoning types, with questions derived directly from specific clauses in the three source policy documents. We have also added explicit discussion of potential selection effects, noting that while the task set emphasizes relational reasoning (as is natural for policy compliance), performance gains from KG augmentation were observed consistently even on simpler entity-lookup tasks. Although a full ablation comparing human- versus LLM-generated tasks lies outside the scope of the current study, the multi-model evaluation and coverage across reasoning types provide controls against design artifacts. These changes clarify the task provenance without altering the reported results. revision: yes

  2. Referee: [Scoring] Scoring subsection: The paper employs both heuristic and LLM-as-judge scoring yet reports neither inter-annotator agreement for the judge nor any human validation of its decisions on a held-out subset. This is load-bearing for the central claim, because any systematic preference of the LLM judge for the more structured outputs produced by KG-augmented prompts could inflate the apparent benefit of KG augmentation and the competitiveness of the LLM-discovered schema.

    Authors: We acknowledge that validation of the LLM-as-judge is essential to support the central claims. The heuristic scoring remains an objective, rule-based baseline. For the LLM judge, we have now conducted a human validation study on a randomly selected held-out subset of 12 tasks. Two independent human annotators (blinded to model and condition) scored all outputs; we report Cohen's kappa of 0.79 between annotators and 86% agreement between the LLM judge and human majority vote. Agreement rates did not differ meaningfully between baseline and KG-augmented conditions, indicating no systematic bias toward structured outputs. These metrics, the validation protocol, and the judge prompt have been added to the Scoring subsection in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on fixed tasks with no equations or self-referential reductions

full rationale

The paper describes an empirical agentic framework that builds KGs from three policy documents under two schemas and reports performance deltas for five LLMs across 42 fixed QA tasks spanning six reasoning types. No equations, derivations, or fitted parameters are present that would reduce the reported KG-augmentation gains to a self-definition, a renamed input, or a self-citation chain. The central claim (KG augmentation improves scores; LLM-discovered schema is competitive) rests on direct measurement against the same task set rather than any construction that forces the outcome by definition. Minor self-citation, if present in the full text, is not load-bearing for the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that extracted knowledge graphs preserve policy semantics sufficiently for downstream QA and that the chosen tasks and scoring methods measure genuine compliance reasoning. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Knowledge graphs constructed from policy text can be used to retrieve information that meaningfully augments LLM reasoning on compliance questions.
    Invoked by the claim that KG augmentation improves scores across all models and reasoning types.
  • domain assumption The 42 QA tasks and dual scoring method (heuristic plus LLM-as-judge) provide a valid proxy for real policy compliance reasoning.
    Required for the evaluation results to support the broader claim about policy compliance.

pith-pipeline@v0.9.0 · 5406 in / 1485 out tokens · 40144 ms · 2026-05-07T05:58:47.706071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    ArXiv:2508.09893

    RAGulating compliance: A multi-agent knowledge graph for regulatory QA.arXiv preprint. ArXiv:2508.09893. Wasi Ahmad, Jianfeng Chi, Yuan Tian, and Kai- Wei Chang. 2020. PolicyQA: A reading compre- hension dataset for privacy policies. InFindings of the Association for Computational Linguis- tics: EMNLP 2020. ArXiv:2010.02557. Wasi Uddin Ahmad, Jianfeng Chi...

  2. [2]

    arXiv preprint arXiv:240419744 URL https://arxiv.org/abs/2404.19744

    PrivComp-KG: Leveraging knowledge graph and large language models for privacy policy compliance verification.arXiv preprint. ArXiv:2404.19744. Delaram Golpayegani, Harshvardhan J. Pandit, and Dave Lewis. 2022. AIRO: An ontology for representing AI risks based on the proposed EU AI Act and ISO risk management standards. InProceedings of the 18th Internatio...

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    PrivacyGLUE: A benchmark dataset for general language understanding in privacy poli- cies.Applied Sciences, 13(6):3701. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the lim- its of mathematical reasoning in open language models.arXiv preprint...

  4. [4]

    Ni, Heung-Yeung Shum, and Jian Guo

    Privacy at scale: Introducing the Pri- vaSeer corpus of web privacy policies. InPro- ceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics (ACL), Volume 1: Long Papers, pages 6829–6839. Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung- Yeung Shum, and Jian Guo. 2024. Think-on- graph: Dee...

  5. [5]

    ReAct: Synergizing Reasoning and Acting in Language Models

    ReAct: Synergizing reasoning and act- ing in language models. InInternational Con- ference on Learning Representations (ICLR). ArXiv:2210.03629. Bowen Zhang and Harold Soh. 2024. Extract, de- fine, canonicalize: An LLM-based framework for knowledge graph construction. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing...