Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz
Pith reviewed 2026-06-25 20:39 UTC · model grok-4.3
The pith
A multi-agent system with hybrid retrieval automates semantic steps in German IT-Grundschutz audits but struggles in deterministic logical phases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The multi-agent system with the Hypothesis-Verification Loop and Decoupled Reasoning Pipeline achieves high efficacy in semantic tasks of the IT-GS process by automating information extraction, but the probabilistic nature of LLMs limits its ability to meet the deterministic requirements in logical reasoning phases such as PNA and IT-GS Check.
What carries the argument
The Hypothesis-Verification Loop in Structural Analysis that cross-references agent-inferred dependencies against the Knowledge Graph, together with the Decoupled Reasoning Pipeline that separates semantic extraction from deterministic protection-need inheritance.
If this is right
- Manual effort drops sharply in structural analysis and modeling through automated extraction of dependencies and assets.
- Quantitative metrics show the architecture meets expert-level precision and recall in semantic tasks but falls short in phases that demand strict logical inheritance.
- The two added mechanisms (Hypothesis-Verification Loop and Decoupled Reasoning Pipeline) are presented as necessary to enforce compliance rigor inside an otherwise probabilistic agent system.
- Current LLMs cannot yet replace human auditors for the full deterministic compliance verification required by IT-GS.
Where Pith is reading between the lines
- The semantic-versus-logic split observed here could be tested on other deterministic audit frameworks such as ISO 27001 to check whether the same pattern appears.
- Hybrid systems that route logical inheritance steps to symbolic solvers rather than LLMs may close the performance gap left open by the current design.
- Broader validation across multiple company sizes and sectors would be needed before claiming the reported effort reductions apply generally.
Load-bearing premise
The single BSI RecPlast GmbH case study supplies a representative reference dataset sufficient to measure performance across every phase of the IT-GS process.
What would settle it
Running the identical architecture on two or more additional independent BSI-provided IT-GS case studies and checking whether the performance gap between semantic phases and logical phases remains consistent.
Figures
read the original abstract
The NIS-2 Directive mandates robust Risk Management from thousands of small and medium enterprises. To ensure compliance, companies rely on established standards such as the German IT-Grundschutz (IT-GS) of the Federal Office for Information Security. However, IT-GS certification is resource-intensive and requires a high level of manual effort for documentation, validation, and revision, making scalable implementation difficult and expensive. Building upon our previous conceptual framework, this paper presents the technical implementation and empirical evaluation of a Multi-Agent System (MAS) architecture combined with Hybrid Retrieval Augmented Generation (HybridRAG) for the partial automation of IT-GS certification. We introduce two novel technical contributions to the MAS architecture to enforce the compliance rigor. The Hypothesis-Verification Loop in the Structural Analysis (SA) phase that cross-references agent-inferred dependencies against the Knowledge Graph to reduce hallucinations, and a Decoupled Reasoning Pipeline that separates agent-driven semantic extraction from the deterministic protection need inheritance. We utilize the BSI's "RecPlast GmbH" case study as a human expert-generated reference data set for end-to-end evaluation of the architecture and to quantify Precision, Recall, and F1-scores. The performance of the system is investigated across the phases of SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check. The empirical results reveal noticeable differences throughout the different steps of IT-GS. While the MAS demonstrates high efficacy in semantic tasks (SA and Modeling), significantly reducing manual effort through automated information extraction, quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the technical implementation of a Multi-Agent System (MAS) with HybridRAG for partial automation of German IT-Grundschutz (IT-GS) certification under NIS-2. It introduces a Hypothesis-Verification Loop for the Structural Analysis (SA) phase and a Decoupled Reasoning Pipeline separating semantic extraction from deterministic protection need inheritance. The system is evaluated end-to-end against the BSI-provided 'RecPlast GmbH' expert-generated reference dataset, reporting precision, recall, and F1 scores across SA, Protection Needs Assessment (PNA), Modeling, and IT-GS Check phases. Results indicate high performance on semantic tasks but limitations on logical reasoning phases, attributed to the probabilistic nature of LLMs versus the deterministic requirements of IT-GS.
Significance. If the phase-specific performance differences generalize, the work supplies quantitative evidence on where LLM-based agents can reduce manual effort in compliance documentation (semantic extraction) versus where they fall short (logical reasoning under deterministic standards). The use of an external BSI case study as ground truth supplies independent grounding for the metrics and builds directly on the authors' prior conceptual framework.
major comments (2)
- [Evaluation section (results across phases, single 'RecPlast GmbH' reference)] Evaluation section (results across phases, single 'RecPlast GmbH' reference): The central claim that 'quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS' rests on performance numbers from only one case study. No error bars, multiple independent runs, sensitivity analysis, or cross-scenario replication are reported, so the observed gaps cannot be distinguished from case-specific factors such as documentation style, knowledge-graph coverage, or annotation conventions.
- [Methods section describing the Decoupled Reasoning Pipeline] Methods section describing the Decoupled Reasoning Pipeline: The separation of agent-driven semantic extraction from deterministic protection-need inheritance is presented as a novel contribution, yet the manuscript provides no pseudocode, formal specification, or ablation showing that the deterministic component is strictly enforced rather than still relying on LLM output. This detail is load-bearing for the claim that the architecture enforces compliance rigor.
minor comments (2)
- Define all phase abbreviations (SA, PNA, IT-GS) at first use in the main text rather than assuming reader familiarity from the abstract.
- Add a summary table of precision/recall/F1 scores by phase to improve readability of the quantitative results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation and methods. We address each major comment below with honest responses and indicate planned revisions.
read point-by-point responses
-
Referee: Evaluation section (results across phases, single 'RecPlast GmbH' reference): The central claim that 'quantitative results reveal limitations in logical reasoning phases (PNA and IT-GS Check) as the probabilistic nature of current LLMs struggles to meet the deterministic rigor required by IT-GS' rests on performance numbers from only one case study. No error bars, multiple independent runs, sensitivity analysis, or cross-scenario replication are reported, so the observed gaps cannot be distinguished from case-specific factors such as documentation style, knowledge-graph coverage, or annotation conventions.
Authors: We agree that the reported phase differences rest on a single BSI-provided case study and that the lack of error bars, multiple runs, or cross-scenario replication prevents strong claims of generalizability. The 'RecPlast GmbH' reference remains valuable as an independent expert-generated dataset, but we will revise the Evaluation section to qualify the central claim, add an explicit limitations paragraph on threats to validity (including case-specific factors), and outline future multi-case validation. No new statistical analyses or runs will be added, as they were not performed. revision: partial
-
Referee: Methods section describing the Decoupled Reasoning Pipeline: The separation of agent-driven semantic extraction from deterministic protection-need inheritance is presented as a novel contribution, yet the manuscript provides no pseudocode, formal specification, or ablation showing that the deterministic component is strictly enforced rather than still relying on LLM output. This detail is load-bearing for the claim that the architecture enforces compliance rigor.
Authors: The Decoupled Reasoning Pipeline is implemented such that semantic extraction uses agents while protection-need inheritance applies deterministic, rule-based logic directly to the extracted entities, bypassing LLM generation entirely. To strengthen this, we will add pseudocode and a formal specification of the pipeline in the revised Methods section, explicitly documenting the separation and confirming the deterministic step does not invoke the LLM. An ablation isolating the component is noted as desirable but may exceed space constraints. revision: yes
Circularity Check
No significant circularity; evaluation grounded in external BSI case study
full rationale
The paper's central empirical claims rest on direct computation of precision, recall, and F1-scores against the external BSI-provided 'RecPlast GmbH' human-expert reference dataset across SA, PNA, Modeling, and IT-GS Check phases. The reference to a prior conceptual framework supplies architectural context but does not define or force the reported performance numbers. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes reduce any result to its own inputs by construction. The derivation chain is therefore self-contained against an independent external benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The BSI RecPlast GmbH case study serves as valid and representative ground truth for end-to-end evaluation of the MAS across SA, PNA, Modeling, and IT-GS Check phases.
Reference graph
Works this paper leans on
-
[1]
L333, pp
European Parliament and Council of the European Union, “Directive (EU) 2022/2555 of the European Parliament and of the Council of 14 December 2022 on measures for a high common level of cybersecurity across the Union, amending Regulation (EU) No 910/2014 and Directive (EU) 2018/1972, and repealing Directive (EU) 2016/1148 (NIS 2 Directive),” Official Jour...
2022
-
[2]
Information security, cybersecurity and privacy protection — Information security management systems — Require- ments
ISO/IEC 27001:2022. Information security, cybersecurity and privacy protection — Information security management systems — Require- ments. Edition 3, 2022. Available: https://www.iso.org/standard/27001. (Accessed: 2025-12-21)
2022
-
[3]
BSI, “BSI-Standards,” Bundesamt f ¨ur Sicherheit in der Information- stechnik. Available: https://www.bsi.bund.de/dok/6603458. (Accessed: 2025-12-21)
arXiv 2025
-
[4]
Pressemitteilung, 30
Bundesamt f ¨ur Sicherheit in der Informationstechnik (BSI).IT- Sicherheitsrecht: NIS-2-Regierungsentwurf ist ein großer Schritt auf dem Weg zur Cybernation. Pressemitteilung, 30. July 2025. Available: https://www.bsi.bund.de/DE/Service-Navi/Presse/Pressemitteilungen/ Presse2025/250730 NIS-2-Regierungsentwurf.html. (Accessed: 2025- 12-21)
2025
-
[5]
Navigating cybersecurity investments in the time of NIS 2,
European Union Agency for Cybersecurity (ENISA), “Navigating cybersecurity investments in the time of NIS 2,” ENISA, Jul. 2023. Available: https://www.enisa.europa.eu/news/ navigating-cybersecurity-investments-in-the-time-of-nis-2. (Accessed: 2025-12-21)
2023
-
[6]
L. Muth and M. Margraf, “An Approach for a Supporting Multi- LLM System for Automated Certification Based on the German IT- Grundschutz,” 2025 IEEE International Conference on Cyber Security and Resilience (CSR), Chania, Crete, Greece, 2025, pp. 482-489. Available: https://doi.org/10.1109/CSR64739.2025.11130171
-
[7]
B. Sarmah et al., “HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Ex- traction,” Proceedings of the 5th ACM International Conference on AI in Finance, 608–616, Nov. 2024. Available: https://doi.org/10.1145/ 3677052.3698671
arXiv 2024
-
[8]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization,
D. Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” arXiv preprint arXiv:2404.16130, Apr
-
[9]
Available: https://doi.org/10.48550/arXiv.2404.16130
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.16130
-
[10]
IT- Grundschutz-Kompendium: Hilfsmittel und Anwenderbeitr ¨age,
Bundesamt f ¨ur Sicherheit in der Informationstechnik, “IT- Grundschutz-Kompendium: Hilfsmittel und Anwenderbeitr ¨age,” BSI, 2023. Available: https://www.bsi.bund.de/DE/Themen/ Unternehmen-und-Organisationen/Standards-und-Zertifizierung/ IT-Grundschutz/Hilfsmittel und Anwenderbeitraege/Hilfsmittel vom BSI/Recplast/recplast node.html. (Accessed: 2025-12-21)
2023
-
[11]
IT- Grundschutz-Kompendium,
Bundesamt f ¨ur Sicherheit in der Informationstechnik, “IT- Grundschutz-Kompendium,” Edition 2023. Available: https: //www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/ IT-GS-Kompendium/IT Grundschutz Kompendium Edition2023.pdf. (Accessed: 2025-12-21)
2023
-
[12]
S. Fenz and T. Neubauer, “Ontology-based information security compli- ance determination and control selection on the example of ISO 27002,” Information & Computer Security, V ol. 26, Nov. 2018. Available: https://doi.org/10.1108/ICS-02-2018-0020
-
[13]
Work in Progress: Leveraging Large Language Models for Cybersecurity Compliance: A Pilot Study in ISO 27001 Audit Planning,
A. Salman, Y . Alsiyat, S. Creese, and M. Goldsmith, “Work in Progress: Leveraging Large Language Models for Cybersecurity Compliance: A Pilot Study in ISO 27001 Audit Planning,” 2025 IEEE European Symposium on Security and Privacy Workshops, pp. 351–359, Jun
2025
-
[14]
Available: https://doi.org/10.1109/EuroSPW67616.2025.00046. (
-
[15]
Constructing Cybersecurity Knowledge Graphs for Hybrid LLM–Graph Reasoning on Vulnerabilities,
J. Vizcarra, Y . Gempei, Y . Wang, T. Isohara, and M. Kurokawa, “Constructing Cybersecurity Knowledge Graphs for Hybrid LLM–Graph Reasoning on Vulnerabilities,” ISWC 2025 Companion V olume, Nov
2025
-
[16]
Available: https://ceur-ws.org/V ol-4085/paper35.pdf
-
[17]
CyKG-RAG: Towards knowledge-graph enhanced retrieval augmented generation for cyberse- curity,
K. Kurniawan, E Kiesling, and A. Ekelhart, “CyKG-RAG: Towards knowledge-graph enhanced retrieval augmented generation for cyberse- curity,” RAGE-KG 2024 Workshop at ISWC 2024, Nov. 2024. Available: https://ceur-ws.org/V ol-3950/paper1.pdf. (Accessed: 2025-12-21)
2024
-
[18]
Available: https://github.com/langchain-ai/langgraph
LangGraph AI. Available: https://github.com/langchain-ai/langgraph. (Accessed: 2025-12-21)
2025
-
[19]
Available: https://platform.openai.com/docs/models/
OpenAI, GPT-4o mini, GPT-4.1, GPT-5 mini, and GPT-OSS-120B. Available: https://platform.openai.com/docs/models/. (Accessed: 2025- 12-21)
2025
-
[20]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, and I. Shafran, “React: Synergizing reasoning and acting in language models,” ICLR 2023, Feb. 2023. Available: https://arxiv.org/abs/2210.03629
Pith/arXiv arXiv 2023
-
[21]
Available: https://www.anthropic.com/ news/claude-haiku-4-5
Anthropic, Claude Haiku 4.5. Available: https://www.anthropic.com/ news/claude-haiku-4-5. (Accessed: 2025-12-21)
2025
-
[22]
Available: https://github.com/dhiaaeddine16/ LLMGraphTransformer
LLMGraphTransformer. Available: https://github.com/dhiaaeddine16/ LLMGraphTransformer. (Accessed: 2025-11-2)
2025
-
[23]
Exploring LLM To Extract Knowledge Graph From Academic Abstracts,
V . E. Yamamoto et al., “Exploring LLM To Extract Knowledge Graph From Academic Abstracts,” ISWC 2025 Companion V olume, Nov. 2025. Available: https://ceur-ws.org/V ol-4085/paper49.pdf
2025
-
[24]
Available: https://platform
OpenAI, text-embedding-3-small Model. Available: https://platform. openai.com/docs/models/text-embedding-3-small. (Accessed: 2025-12- 21)
2025
-
[25]
AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation,
S. Srinivas et al., “AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation,” Journal of Cybersecurity and Privacy, V ol. 5, Article Nr. 95, Sep. 2025. Available: https://doi.org/10.3390/jcp5040095
-
[26]
Revisiting Representation Degeneration Problem in Language Modeling,
Z. Zhang et al., “Revisiting Representation Degeneration Problem in Language Modeling,” In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 518—527, Nov. 2020. Available: https: //doi.org/10.18653/v1/2020.findings-emnlp.46
-
[27]
Embedding-Based de- cision support framework for large-scale content analysis
M. Kamat, J. Jagasia, A. Vaidya, and O. Surve, “Embedding-Based de- cision support framework for large-scale content analysis”, Knowledge- Based Systems, V olume 332, Nov. 2025. Available: https://doi.org/10. 1016/j.knosys.2025.114926. (Accessed: 2025-12-21)
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.