Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Pith reviewed 2026-05-22 18:22 UTC · model grok-4.3
The pith
Parallel compliance architecture with LLMs improves correctness in railway OT cybersecurity verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Parallel Compliance Architecture (PCA) that adds regulatory excerpts in parallel to the query significantly improves both correctness and reasoning quality over the Baseline Compliance Architecture (BCA) when LLMs answer OTCS compliance queries for railway systems.
What carries the argument
The Parallel Compliance Architecture (PCA), a multi-stage retrieval method that supplies extra context drawn directly from regulatory standards to the LLM prompt.
If this is right
- Retrieval-augmented LLM approaches raise efficiency and accuracy of compliance assessments in regulated industries.
- Defined metrics for correctness, reasoning, and hallucination provide a repeatable way to evaluate LLM outputs on technical standards.
- The method offers a practical aid for sectors facing cybersecurity expertise shortages.
Where Pith is reading between the lines
- The same parallel retrieval pattern could be tested on compliance tasks in energy grids or water systems that use similar IEC standards.
- Pairing the architecture with live updates to standards documents would reduce the need for manual re-indexing.
- A follow-up study could measure how often experts accept or override the model's final compliance judgments in practice.
Load-bearing premise
The selected compliance queries and regulatory excerpts are representative of real operational technology challenges, and automated metrics for correctness and hallucination match what a domain expert would judge.
What would settle it
Domain experts manually scoring the same set of queries find no measurable gain in correctness or reasoning quality when the parallel regulatory context is added.
Figures
read the original abstract
Operational Technology Cybersecurity (OTCS) continues to be a dominant challenge for critical infrastructure such as railways. As these systems become increasingly vulnerable to malicious attacks due to digitalization, effective documentation and compliance processes are essential to protect these safety-critical systems. This paper proposes a novel system that leverages Large Language Models (LLMs) and multi-stage retrieval to enhance the compliance verification process against standards like IEC 62443 and the rail-specific IEC 63452. We first evaluate a Baseline Compliance Architecture (BCA) for answering OTCS compliance queries, then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards. Through empirical evaluation comparing OpenAI-gpt-4o and Claude-3.5-haiku models in these architectures, we demonstrate that the PCA significantly improves both correctness and reasoning quality in compliance verification. Our research establishes metrics for response correctness, logical reasoning, and hallucination detection, highlighting the strengths and limitations of using LLMs for compliance verification in railway cybersecurity. The results suggest that retrieval-augmented approaches can significantly improve the efficiency and accuracy of compliance assessments, particularly valuable in an industry facing a shortage of cybersecurity expertise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Parallel Compliance Architecture (PCA) that extends a Baseline Compliance Architecture (BCA) with multi-stage retrieval from regulatory standards (IEC 62443, IEC 63452) to answer Operational Technology Cybersecurity (OTCS) queries for railway systems. It evaluates the two architectures using GPT-4o and Claude-3.5-haiku, reports that PCA yields higher correctness and reasoning quality, and defines automated metrics for correctness, logical reasoning, and hallucination detection.
Significance. If the empirical gains are robust, the work could help address the shortage of OT cybersecurity expertise in critical infrastructure by improving the efficiency of compliance checks against rail-specific standards. The case-study framing and explicit comparison of retrieval-augmented versus baseline LLM prompting are practical strengths.
major comments (2)
- [Evaluation / Results] Evaluation section (and associated results tables): the central claim that PCA 'significantly improves both correctness and reasoning quality' rests on automated metrics whose correlation with domain-expert judgments on regulatory compliance is not demonstrated. Without inter-rater agreement, rubric details, or expert validation on the same query set, the reported improvements cannot be confirmed as evidence of better compliance verification.
- [Methods] Query selection and dataset construction (Methods): the manuscript provides no description of how the compliance queries were chosen, whether they cover the full distribution of operational railway OT questions (e.g., safety-function allocation, residual-risk statements), or how regulatory excerpts were sampled. This directly affects the generalizability of the PCA improvement claim.
minor comments (2)
- [Architecture] Clarify the exact retrieval stages and prompt templates used in PCA versus BCA; a diagram or pseudocode would aid reproducibility.
- [Results] The abstract states 'we demonstrate that the PCA significantly improves…' but the results section should report effect sizes, confidence intervals, or statistical tests rather than qualitative descriptors alone.
Simulated Author's Rebuttal
We are grateful to the referee for their thorough review and constructive feedback on our paper. We have addressed each of the major comments in detail below. We will make revisions to the manuscript to incorporate clarifications and additional details as outlined in our responses.
read point-by-point responses
-
Referee: Evaluation section (and associated results tables): the central claim that PCA 'significantly improves both correctness and reasoning quality' rests on automated metrics whose correlation with domain-expert judgments on regulatory compliance is not demonstrated. Without inter-rater agreement, rubric details, or expert validation on the same query set, the reported improvements cannot be confirmed as evidence of better compliance verification.
Authors: We agree that demonstrating correlation between our automated metrics and domain-expert judgments would provide stronger evidence for the improvements. The current manuscript defines the metrics for correctness, logical reasoning, and hallucination detection based on logical and factual criteria suitable for compliance queries. However, we did not perform expert validation or report inter-rater agreement in this study. In the revised version, we will expand the Evaluation section to provide full rubric details and add a dedicated limitations paragraph acknowledging the absence of expert validation and outlining plans for future work in this direction. This will temper the claims appropriately while retaining the value of the comparative results. revision: yes
-
Referee: Query selection and dataset construction (Methods): the manuscript provides no description of how the compliance queries were chosen, whether they cover the full distribution of operational railway OT questions (e.g., safety-function allocation, residual-risk statements), or how regulatory excerpts were sampled. This directly affects the generalizability of the PCA improvement claim.
Authors: We appreciate this observation. The queries were curated to reflect typical OT cybersecurity compliance inquiries in railway contexts, informed by the standards IEC 62443 and IEC 63452, with an emphasis on practical operational scenarios. Regulatory excerpts were sampled from key sections relevant to the queries. To improve transparency and address generalizability concerns, we will revise the Methods section to include a detailed description of the query selection criteria, the range of topics covered (including safety-function allocation and risk-related statements), and the sampling approach for regulatory documents. We believe this addition will strengthen the manuscript without altering the core findings. revision: yes
Circularity Check
No significant circularity: empirical evaluation of retrieval architectures on external standards
full rationale
The paper conducts an empirical comparison of Baseline Compliance Architecture (BCA) versus Parallel Compliance Architecture (PCA) using gpt-4o and Claude-3.5-haiku on OTCS queries drawn from IEC 62443 and IEC 63452. It defines automated metrics for correctness, reasoning quality, and hallucination, then reports that PCA yields higher scores. No equations or derivations are present that reduce a claimed result to its own inputs by construction. No load-bearing self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The evaluation uses held-out queries against external regulatory excerpts, rendering the central claim self-contained against benchmarks rather than tautological. This matches the expected honest non-finding for an applied empirical case study.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first evaluate a Baseline Compliance Architecture (BCA) ... then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The retrieval component pη(y|x) ... hybrid query mode ... α = 0.75
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments
DRAFT fine-tunes LLMs with a dual-retrieval architecture and semi-automated datasets containing distractors to achieve 7% higher correctness in safety compliance assessments.
Reference graph
Works this paper leans on
-
[1]
A review on cybersecurity in railways,
R. Kour, A. Patwardhan, A. Thaduri, and R. Karim, “A review on cybersecurity in railways,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit , vol. 237, no. 1, pp. 3–20, 2023
work page 2023
-
[2]
Change management in digitalised operation and maintenance of rail- way,
V . J ¨agare, R. Karim, P. S ¨oderholm, P.-O. Larsson-Kr ˚aik, and U. Juntti, “Change management in digitalised operation and maintenance of rail- way,” in International Heavy Haul Association (IHHA) STS 2019, 10- 14th June 2019, Narvik, Norway. , 2019, pp. 904–911
work page 2019
-
[3]
Cyber-physical security risk assessment for train control and monitoring systems,
M. Rekik, C. Gransart, and M. Berbineau, “Cyber-physical security risk assessment for train control and monitoring systems,” in 2018 IEEE Conference on Communications and Network Security (CNS) . IEEE, 2018, pp. 1–9
work page 2018
-
[4]
Aligning cyber-physical system safety and security,
G. Sabaliauskaite and A. P. Mathur, “Aligning cyber-physical system safety and security,” in Complex Systems Design & Management Asia: Designing Smart Cities: Proceedings of the First Asia-Pacific Confer- ence on Complex Systems Design & Management, CSD&M Asia 2014 . Springer, 2015, pp. 41–53
work page 2014
-
[5]
emaintenance in railways: Issues and challenges in cybersecurity,
R. Kour, M. Aljumaili, R. Karim, and P. Tretten, “emaintenance in railways: Issues and challenges in cybersecurity,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit, vol. 233, no. 10, pp. 1012–1022, 2019
work page 2019
-
[6]
“Iec62443 suite of standards,” 2024. [Online]. Available: https://www.isa.org/standards-and-publications/isa-standards/ isa-iec-62443-series-of-standards
work page 2024
-
[7]
BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,
BSI Group, “BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,” 2024. [Online]. Available: https://standardsdevelopment. bsigroup.com/projects/2022-01003/section
work page 2024
-
[8]
Best practices for cybersecurity compliance monitoring,
A. T. Tunggal, “Best practices for cybersecurity compliance monitoring,” 2024, updated April 21, 2024. [Online]. Available: https://www.upguard. com/blog/compliance-monitoring 10
work page 2024
-
[9]
J. M. Stewart, E. Tittel, and M. Chapple, CISSP: Certified information systems security professional study guide . John Wiley & Sons, 2011
work page 2011
-
[10]
The promise of automated compliance checking,
R. Amor and J. Dimyadi, “The promise of automated compliance checking,” Developments in the built environment , vol. 5, p. 100039, 2021
work page 2021
-
[11]
Development of an object model for automated compliance checking,
S. Malsane, J. Matthews, S. Lockley, P. E. Love, and D. Greenwood, “Development of an object model for automated compliance checking,” Automation in construction , vol. 49, pp. 51–58, 2015
work page 2015
-
[12]
J. Zhang and N. M. El-Gohary, “Semantic nlp-based information extrac- tion from construction regulatory documents for automated compliance checking,” Journal of computing in civil engineering , vol. 30, no. 2, p. 04015014, 2016
work page 2016
-
[13]
An introduction to large language models (llms),
S. Hore, “An introduction to large language models (llms),”
-
[14]
[Online]. Available: https://www.analyticsvidhya.com/blog/2023/ 03/an-introduction-to-large-language-models-llms/
work page 2023
-
[15]
S. Jose, K. T. Nguyen, K. Medjaher, R. Zemouri, M. L ´evesque, and A. Tahan, “Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models,” Expert Systems with Applications , vol. 255, p. 124603, 2024
work page 2024
-
[16]
Language model-guided student performance prediction with multimodal auxiliary information,
C. Oh, M. Park, S. Lim, and K. Song, “Language model-guided student performance prediction with multimodal auxiliary information,” Expert Systems with Applications , vol. 250, p. 123960, 2024
work page 2024
-
[17]
Prompting gpt–4 to support automatic safety case generation,
M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Prompting gpt–4 to support automatic safety case generation,” Expert Systems with Applications, vol. 255, p. 124653, 2024
work page 2024
-
[18]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al. , “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Llm as a mastermind: A survey of strategic reasoning with large language models,
Y . Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y . Xia, W. Wu, T. Song, M. Lan, and F. Wei, “Llm as a mastermind: A survey of strategic reasoning with large language models,” arXiv preprint arXiv:2404.01230, 2024
-
[21]
B. Perak, S. Beliga, and A. Me ˇstrovi´c, “Incorporating dialect under- standing into llm using rag and prompt engineering techniques for causal commonsense reasoning,” in Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) , 2024, pp. 220–229
work page 2024
-
[22]
F. Bianchini, M. Calamo, F. De Luzi, M. Macr `ı, and M. Mecella, “Enhancing complex linguistic tasks resolution through fine-tuning llms, rag and knowledge graphs (short paper),” in International Conference on Advanced Information Systems Engineering . Springer, 2024, pp. 147–155
work page 2024
-
[23]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020
work page 2020
-
[24]
Retrieval-augmented generation (rag),
G. Cloud, “Retrieval-augmented generation (rag),” 2024, accessed: August 7, 2024. [Online]. Available: https://cloud.google.com/use-cases/ retrieval-augmented-generation?hl=en
work page 2024
-
[25]
Automated building code compliance checking–where is it at,
J. Dimyadi and R. Amor, “Automated building code compliance checking–where is it at,” Proceedings of CIB WBC , vol. 6, no. 1, 2013
work page 2013
-
[26]
A gpt-based method of automated compli- ance checking through prompt engineering,
X. Liu, H. Li, and X. Zhu, “A gpt-based method of automated compli- ance checking through prompt engineering,” 2023
work page 2023
-
[27]
Gpt for rcts?: Using ai to measure adherence to reporting guidelines,
J. G. Wrightson, P. Blazey, K. M. Khan, and C. L. Ardern, “Gpt for rcts?: Using ai to measure adherence to reporting guidelines,” medRxiv, pp. 2023–12, 2023
work page 2023
-
[28]
Towards standards- compliant assistive technology product specifications via llms,
C. Arora, J. Grundy, L. Puli, and N. Layton, “Towards standards- compliant assistive technology product specifications via llms,” arXiv preprint arXiv:2404.03122, 2024
-
[29]
Long- context llms struggle with long in-context learning,
T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long- context llms struggle with long in-context learning,” arXiv preprint arXiv:2404.02060, 2024
-
[30]
Evaluating retrieval quality in retrieval- augmented generation,
A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval- augmented generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, 2024, pp. 2395–2400
work page 2024
-
[31]
Cohere, “Cohere-embed-english-v3.0,” https://huggingface.co/Cohere/ Cohere-embed-english-v3.0, Cohere, 2024, accessed: 2025-02-26
work page 2024
-
[32]
Openai debuts gpt-4o’omni’model now powering chatgpt,
K. Wiggers, “Openai debuts gpt-4o’omni’model now powering chatgpt,” TechCrunch. Retrieved May, vol. 16, p. 2024, 2024
work page 2024
-
[33]
Automatic detection of llm-generated code: A case study of claude 3 haiku,
M. Rahman, S. Khatoonabadi, A. Abdellatif, and E. Shihab, “Automatic detection of llm-generated code: A case study of claude 3 haiku,” arXiv preprint arXiv:2409.01382, 2024
-
[34]
Openrouter: Api for accessing open-source and propri- etary llms,
OpenRouter, “Openrouter: Api for accessing open-source and propri- etary llms,” https://openrouter.ai/, 2023, accessed: 2024-09-20
work page 2023
-
[35]
J. Liu, “LlamaIndex,” 11 2022. [Online]. Available: https://github.com/ jerryjliu/llama index
work page 2022
-
[36]
G. Van Rossum and F. L. Drake Jr, Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995
work page 1995
-
[37]
J. Liu, “Introducing llamacloud and llamaparse - llamaindex - build knowledge assistants over your enterprise data,” Feb 2024. [Online]. Available: https://www.llamaindex.ai/blog/ introducing-llamacloud-and-llamaparse-af8cedf9006b
work page 2024
-
[38]
I. Cheong, K. Xia, K. K. Feng, Q. Z. Chen, and A. X. Zhang, “(a) i am not a lawyer, but...: Engaging legal experts towards responsible llm policies for legal advice,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024, pp. 2454–2469
work page 2024
-
[39]
Evaluation of llm agents for the soc tier 1 analyst triage process,
O. Oniagbi, A. Hakkala, and I. Hasanov, “Evaluation of llm agents for the soc tier 1 analyst triage process,” 2024
work page 2024
-
[40]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023
work page 2023
-
[41]
Phoenix: Open-source ml observability and performance debug- ging,
A. AI, “Phoenix: Open-source ml observability and performance debug- ging,” https://github.com/Arize-ai/phoenix, 2023, accessed: 2024-09-20
work page 2023
-
[42]
LLM Evaluators Recognize and Favor Their Own Generations
A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations,” arXiv preprint arXiv:2404.13076 , 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,
A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,” arXiv preprint arXiv:2406.12624 , 2024
-
[44]
Distributed ledger for cybersecurity: issues and challenges in railways,
A. Patwardhan, A. Thaduri, and R. Karim, “Distributed ledger for cybersecurity: issues and challenges in railways,” Sustainability, vol. 13, no. 18, p. 10176, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.