pith. sign in

arxiv: 2606.25622 · v1 · pith:XUMG72KJnew · submitted 2026-06-24 · 💻 cs.CR · cs.AI

Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz

Pith reviewed 2026-06-25 20:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords multi-agent systemsIT-Grundschutzautomated auditsHybridRAGcompliance automationLLM evaluationrisk management
0
0 comments X

The pith

A multi-agent system with hybrid retrieval automates semantic steps in German IT-Grundschutz audits but struggles in deterministic logical phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a multi-agent system combined with HybridRAG to partially automate the resource-intensive German IT-Grundschutz certification required under NIS-2. It adds a Hypothesis-Verification Loop that cross-checks agent inferences against a knowledge graph and a Decoupled Reasoning Pipeline that keeps semantic extraction separate from deterministic inheritance rules. Tested end-to-end on the BSI RecPlast GmbH reference case, the system delivers high precision and recall in structural analysis and modeling yet shows clear shortfalls in protection needs assessment and the final IT-GS check where probabilistic outputs clash with required deterministic logic.

Core claim

The multi-agent system with the Hypothesis-Verification Loop and Decoupled Reasoning Pipeline achieves high efficacy in semantic tasks of the IT-GS process by automating information extraction, but the probabilistic nature of LLMs limits its ability to meet the deterministic requirements in logical reasoning phases such as PNA and IT-GS Check.

What carries the argument

The Hypothesis-Verification Loop in Structural Analysis that cross-references agent-inferred dependencies against the Knowledge Graph, together with the Decoupled Reasoning Pipeline that separates semantic extraction from deterministic protection-need inheritance.

If this is right

  • Manual effort drops sharply in structural analysis and modeling through automated extraction of dependencies and assets.
  • Quantitative metrics show the architecture meets expert-level precision and recall in semantic tasks but falls short in phases that demand strict logical inheritance.
  • The two added mechanisms (Hypothesis-Verification Loop and Decoupled Reasoning Pipeline) are presented as necessary to enforce compliance rigor inside an otherwise probabilistic agent system.
  • Current LLMs cannot yet replace human auditors for the full deterministic compliance verification required by IT-GS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The semantic-versus-logic split observed here could be tested on other deterministic audit frameworks such as ISO 27001 to check whether the same pattern appears.
  • Hybrid systems that route logical inheritance steps to symbolic solvers rather than LLMs may close the performance gap left open by the current design.
  • Broader validation across multiple company sizes and sectors would be needed before claiming the reported effort reductions apply generally.

Load-bearing premise

The single BSI RecPlast GmbH case study supplies a representative reference dataset sufficient to measure performance across every phase of the IT-GS process.

What would settle it

Running the identical architecture on two or more additional independent BSI-provided IT-GS case studies and checking whether the performance gap between semantic phases and logical phases remains consistent.

Figures

Figures reproduced from arXiv: 2606.25622 by Lea Roxanne Muth, Marian Margraf.

Figure 1
Figure 1. Figure 1: Process of the IT-Grundschutz Certification. The diagram shows the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, this paper presents the technical implementation and empirical evaluation of a Multi-Agent System (MAS) architecture combined with Hybrid Retrieval Augmented Generation (HybridRAG) for the partial automation of IT-GS certification. We introduce two novel technical contributions to the MAS architecture to enforce the compliance rigor. The Hypothesis-Verification Loop in the Structural Analysis (SA) phase that cross-references agent-inferred dependencies against the Knowledge Graph to reduce hallucinations, and a Decoupled Reasoning Pipeline that separates agent-driven semantic extraction from the deterministic protection need inheritance. We utilize the BSI's "RecPlast GmbH" case study as a human expert-generated reference data set for end-to-end evaluation of the architecture and to quantify Precision, Recall, and F1-scores. The performance of the system is investigated across the phases of SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check. The empirical results reveal noticeable differences throughout the different steps of IT-GS. While the MAS demonstrates high efficacy in semantic tasks (SA and Modeling), significantly reducing manual effort through automated information extraction, quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the technical implementation of a Multi-Agent System (MAS) with HybridRAG for partial automation of German IT-Grundschutz (IT-GS) certification under NIS-2. It introduces a Hypothesis-Verification Loop for the Structural Analysis (SA) phase and a Decoupled Reasoning Pipeline separating semantic extraction from deterministic protection need inheritance. The system is evaluated end-to-end against the BSI-provided 'RecPlast GmbH' expert-generated reference dataset, reporting precision, recall, and F1 scores across SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check phases. Results indicate high performance on semantic tasks but limitations on logical reasoning phases, attributed to the probabilistic nature of LLMs versus the deterministic requirements of IT-GS.

Significance. If the phase-specific performance differences generalize, the work supplies quantitative evidence on where LLM-based agents can reduce manual effort in compliance documentation (semantic extraction) versus where they fall short (logical reasoning under deterministic standards). The use of an external BSI case study as ground truth supplies independent grounding for the metrics and builds directly on the authors' prior conceptual framework.

major comments (2)
  1. [Evaluation section (results across phases, single 'RecPlast GmbH' reference)] Evaluation section (results across phases, single 'RecPlast GmbH' reference): The central claim that 'quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS' rests on performance numbers from only one case study. No error bars, multiple independent runs, sensitivity analysis, or cross-scenario replication are reported, so the observed gaps cannot be distinguished from case-specific factors such as documentation style, knowledge-graph coverage, or annotation conventions.
  2. [Methods section describing the Decoupled Reasoning Pipeline] Methods section describing the Decoupled Reasoning Pipeline: The separation of agent-driven semantic extraction from deterministic protection-need inheritance is presented as a novel contribution, yet the manuscript provides no pseudocode, formal specification, or ablation showing that the deterministic component is strictly enforced rather than still relying on LLM output. This detail is load-bearing for the claim that the architecture enforces compliance rigor.
minor comments (2)
  1. Define all phase abbreviations (SA, PNA, IT-GS) at first use in the main text rather than assuming reader familiarity from the abstract.
  2. Add a summary table of precision/recall/F1 scores by phase to improve readability of the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and methods. We address each major comment below with honest responses and indicate planned revisions.

read point-by-point responses
  1. Referee: Evaluation section (results across phases, single 'RecPlast GmbH' reference): The central claim that 'quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS' rests on performance numbers from only one case study. No error bars, multiple independent runs, sensitivity analysis, or cross-scenario replication are reported, so the observed gaps cannot be distinguished from case-specific factors such as documentation style, knowledge-graph coverage, or annotation conventions.

    Authors: We agree that the reported phase differences rest on a single BSI-provided case study and that the lack of error bars, multiple runs, or cross-scenario replication prevents strong claims of generalizability. The 'RecPlast GmbH' reference remains valuable as an independent expert-generated dataset, but we will revise the Evaluation section to qualify the central claim, add an explicit limitations paragraph on threats to validity (including case-specific factors), and outline future multi-case validation. No new statistical analyses or runs will be added, as they were not performed. revision: partial

  2. Referee: Methods section describing the Decoupled Reasoning Pipeline: The separation of agent-driven semantic extraction from deterministic protection-need inheritance is presented as a novel contribution, yet the manuscript provides no pseudocode, formal specification, or ablation showing that the deterministic component is strictly enforced rather than still relying on LLM output. This detail is load-bearing for the claim that the architecture enforces compliance rigor.

    Authors: The Decoupled Reasoning Pipeline is implemented such that semantic extraction uses agents while protection-need inheritance applies deterministic, rule-based logic directly to the extracted entities, bypassing LLM generation entirely. To strengthen this, we will add pseudocode and a formal specification of the pipeline in the revised Methods section, explicitly documenting the separation and confirming the deterministic step does not invoke the LLM. An ablation isolating the component is noted as desirable but may exceed space constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation grounded in external BSI case study

full rationale

The paper's central empirical claims rest on direct computation of precision, recall, and F1-scores against the external BSI-provided 'RecPlast GmbH' human-expert reference dataset across SA, PNA, Modeling, and IT-GS Check phases. The reference to a prior conceptual framework supplies architectural context but does not define or force the reported performance numbers. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes reduce any result to its own inputs by construction. The derivation chain is therefore self-contained against an independent external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the single RecPlast case study is representative ground truth and that F1 scores on semantic versus logical phases generalize beyond this instance.

axioms (1)
  • domain assumption The BSI RecPlast GmbH case study serves as valid and representative ground truth for end-to-end evaluation of the MAS across SA, PNA, Modeling, and IT-GS Check phases.
    Used as the reference dataset to compute precision, recall, and F1 scores.

pith-pipeline@v0.9.1-grok · 5851 in / 1284 out tokens · 20967 ms · 2026-06-25T20:39:26.538904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    L333, pp

    European Parliament and Council of the European Union, “Directive (EU) 2022/2555 of the European Parliament and of the Council of 14 December 2022 on measures for a high common level of cybersecurity across the Union, amending Regulation (EU) No 910/2014 and Directive (EU) 2018/1972, and repealing Directive (EU) 2016/1148 (NIS 2 Directive),” Official Jour...

  2. [2]

    Information security, cybersecurity and privacy protection — Information security management systems — Require- ments

    ISO/IEC 27001:2022. Information security, cybersecurity and privacy protection — Information security management systems — Require- ments. Edition 3, 2022. Available: https://www.iso.org/standard/27001. (Accessed: 2025-12-21)

  3. [3]

    BSI-Standards,

    BSI, “BSI-Standards,” Bundesamt f ¨ur Sicherheit in der Information- stechnik. Available: https://www.bsi.bund.de/dok/6603458. (Accessed: 2025-12-21)

  4. [4]

    Pressemitteilung, 30

    Bundesamt f ¨ur Sicherheit in der Informationstechnik (BSI).IT- Sicherheitsrecht: NIS-2-Regierungsentwurf ist ein großer Schritt auf dem Weg zur Cybernation. Pressemitteilung, 30. July 2025. Available: https://www.bsi.bund.de/DE/Service-Navi/Presse/Pressemitteilungen/ Presse2025/250730 NIS-2-Regierungsentwurf.html. (Accessed: 2025- 12-21)

  5. [5]

    Navigating cybersecurity investments in the time of NIS 2,

    European Union Agency for Cybersecurity (ENISA), “Navigating cybersecurity investments in the time of NIS 2,” ENISA, Jul. 2023. Available: https://www.enisa.europa.eu/news/ navigating-cybersecurity-investments-in-the-time-of-nis-2. (Accessed: 2025-12-21)

  6. [6]

    An Approach for a Supporting Multi- LLM System for Automated Certification Based on the German IT- Grundschutz,

    L. Muth and M. Margraf, “An Approach for a Supporting Multi- LLM System for Automated Certification Based on the German IT- Grundschutz,” 2025 IEEE International Conference on Cyber Security and Resilience (CSR), Chania, Crete, Greece, 2025, pp. 482-489. Available: https://doi.org/10.1109/CSR64739.2025.11130171

  7. [7]

    HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Ex- traction,

    B. Sarmah et al., “HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Ex- traction,” Proceedings of the 5th ACM International Conference on AI in Finance, 608–616, Nov. 2024. Available: https://doi.org/10.1145/ 3677052.3698671

  8. [8]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization,

    D. Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” arXiv preprint arXiv:2404.16130, Apr

  9. [9]

    Available: https://doi.org/10.48550/arXiv.2404.16130

  10. [10]

    IT- Grundschutz-Kompendium: Hilfsmittel und Anwenderbeitr ¨age,

    Bundesamt f ¨ur Sicherheit in der Informationstechnik, “IT- Grundschutz-Kompendium: Hilfsmittel und Anwenderbeitr ¨age,” BSI, 2023. Available: https://www.bsi.bund.de/DE/Themen/ Unternehmen-und-Organisationen/Standards-und-Zertifizierung/ IT-Grundschutz/Hilfsmittel und Anwenderbeitraege/Hilfsmittel vom BSI/Recplast/recplast node.html. (Accessed: 2025-12-21)

  11. [11]

    IT- Grundschutz-Kompendium,

    Bundesamt f ¨ur Sicherheit in der Informationstechnik, “IT- Grundschutz-Kompendium,” Edition 2023. Available: https: //www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/ IT-GS-Kompendium/IT Grundschutz Kompendium Edition2023.pdf. (Accessed: 2025-12-21)

  12. [12]

    Ontology-based information security compli- ance determination and control selection on the example of ISO 27002,

    S. Fenz and T. Neubauer, “Ontology-based information security compli- ance determination and control selection on the example of ISO 27002,” Information & Computer Security, V ol. 26, Nov. 2018. Available: https://doi.org/10.1108/ICS-02-2018-0020

  13. [13]

    Work in Progress: Leveraging Large Language Models for Cybersecurity Compliance: A Pilot Study in ISO 27001 Audit Planning,

    A. Salman, Y . Alsiyat, S. Creese, and M. Goldsmith, “Work in Progress: Leveraging Large Language Models for Cybersecurity Compliance: A Pilot Study in ISO 27001 Audit Planning,” 2025 IEEE European Symposium on Security and Privacy Workshops, pp. 351–359, Jun

  14. [14]

    Available: https://doi.org/10.1109/EuroSPW67616.2025.00046. (

  15. [15]

    Constructing Cybersecurity Knowledge Graphs for Hybrid LLM–Graph Reasoning on Vulnerabilities,

    J. Vizcarra, Y . Gempei, Y . Wang, T. Isohara, and M. Kurokawa, “Constructing Cybersecurity Knowledge Graphs for Hybrid LLM–Graph Reasoning on Vulnerabilities,” ISWC 2025 Companion V olume, Nov

  16. [16]

    Available: https://ceur-ws.org/V ol-4085/paper35.pdf

  17. [17]

    CyKG-RAG: Towards knowledge-graph enhanced retrieval augmented generation for cyberse- curity,

    K. Kurniawan, E Kiesling, and A. Ekelhart, “CyKG-RAG: Towards knowledge-graph enhanced retrieval augmented generation for cyberse- curity,” RAGE-KG 2024 Workshop at ISWC 2024, Nov. 2024. Available: https://ceur-ws.org/V ol-3950/paper1.pdf. (Accessed: 2025-12-21)

  18. [18]

    Available: https://github.com/langchain-ai/langgraph

    LangGraph AI. Available: https://github.com/langchain-ai/langgraph. (Accessed: 2025-12-21)

  19. [19]

    Available: https://platform.openai.com/docs/models/

    OpenAI, GPT-4o mini, GPT-4.1, GPT-5 mini, and GPT-OSS-120B. Available: https://platform.openai.com/docs/models/. (Accessed: 2025- 12-21)

  20. [20]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, and I. Shafran, “React: Synergizing reasoning and acting in language models,” ICLR 2023, Feb. 2023. Available: https://arxiv.org/abs/2210.03629

  21. [21]

    Available: https://www.anthropic.com/ news/claude-haiku-4-5

    Anthropic, Claude Haiku 4.5. Available: https://www.anthropic.com/ news/claude-haiku-4-5. (Accessed: 2025-12-21)

  22. [22]

    Available: https://github.com/dhiaaeddine16/ LLMGraphTransformer

    LLMGraphTransformer. Available: https://github.com/dhiaaeddine16/ LLMGraphTransformer. (Accessed: 2025-11-2)

  23. [23]

    Exploring LLM To Extract Knowledge Graph From Academic Abstracts,

    V . E. Yamamoto et al., “Exploring LLM To Extract Knowledge Graph From Academic Abstracts,” ISWC 2025 Companion V olume, Nov. 2025. Available: https://ceur-ws.org/V ol-4085/paper49.pdf

  24. [24]

    Available: https://platform

    OpenAI, text-embedding-3-small Model. Available: https://platform. openai.com/docs/models/text-embedding-3-small. (Accessed: 2025-12- 21)

  25. [25]

    AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation,

    S. Srinivas et al., “AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation,” Journal of Cybersecurity and Privacy, V ol. 5, Article Nr. 95, Sep. 2025. Available: https://doi.org/10.3390/jcp5040095

  26. [26]

    Revisiting Representation Degeneration Problem in Language Modeling,

    Z. Zhang et al., “Revisiting Representation Degeneration Problem in Language Modeling,” In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 518—527, Nov. 2020. Available: https: //doi.org/10.18653/v1/2020.findings-emnlp.46

  27. [27]

    Embedding-Based de- cision support framework for large-scale content analysis

    M. Kamat, J. Jagasia, A. Vaidya, and O. Surve, “Embedding-Based de- cision support framework for large-scale content analysis”, Knowledge- Based Systems, V olume 332, Nov. 2025. Available: https://doi.org/10. 1016/j.knosys.2025.114926. (Accessed: 2025-12-21)