Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

Dan Basher; Howard Parkinson; Mohammadreza Sheikhfathollahi; Regan Bolton; Simon Parkinson

arxiv: 2504.14044 · v1 · submitted 2025-04-18 · 💻 cs.AI · cs.CR

Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy

Regan Bolton , Mohammadreza Sheikhfathollahi , Simon Parkinson , Dan Basher , Howard Parkinson This is my paper

Pith reviewed 2026-05-22 18:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords Large Language ModelsCybersecurity ComplianceOperational TechnologyRailway SystemsRetrieval-Augmented GenerationIEC 62443Compliance Verification

0 comments

The pith

Parallel compliance architecture with LLMs improves correctness in railway OT cybersecurity verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes a multi-stage retrieval system that uses large language models to verify compliance with operational technology cybersecurity standards in railways. It first tests a baseline architecture for answering queries against standards like IEC 62443 and IEC 63452, then introduces a parallel compliance architecture that supplies additional regulatory context. Empirical tests with GPT-4o and Claude-3.5-haiku show the parallel version raises both answer correctness and reasoning quality. The work also defines metrics to track correctness, logical reasoning, and hallucination. The approach addresses the shortage of cybersecurity experts while critical infrastructure faces growing digital threats.

Core claim

The Parallel Compliance Architecture (PCA) that adds regulatory excerpts in parallel to the query significantly improves both correctness and reasoning quality over the Baseline Compliance Architecture (BCA) when LLMs answer OTCS compliance queries for railway systems.

What carries the argument

The Parallel Compliance Architecture (PCA), a multi-stage retrieval method that supplies extra context drawn directly from regulatory standards to the LLM prompt.

If this is right

Retrieval-augmented LLM approaches raise efficiency and accuracy of compliance assessments in regulated industries.
Defined metrics for correctness, reasoning, and hallucination provide a repeatable way to evaluate LLM outputs on technical standards.
The method offers a practical aid for sectors facing cybersecurity expertise shortages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parallel retrieval pattern could be tested on compliance tasks in energy grids or water systems that use similar IEC standards.
Pairing the architecture with live updates to standards documents would reduce the need for manual re-indexing.
A follow-up study could measure how often experts accept or override the model's final compliance judgments in practice.

Load-bearing premise

The selected compliance queries and regulatory excerpts are representative of real operational technology challenges, and automated metrics for correctness and hallucination match what a domain expert would judge.

What would settle it

Domain experts manually scoring the same set of queries find no measurable gain in correctness or reasoning quality when the parallel regulatory context is added.

Figures

Figures reproduced from arXiv: 2504.14044 by Dan Basher, Howard Parkinson, Mohammadreza Sheikhfathollahi, Regan Bolton, Simon Parkinson.

**Figure 2.** Figure 2: System prompt for the BCA You will be provided with some documentation. ===================== **User Documentation** ===================== {user_docs_str} ================================================================== Based **solely** on the **User Documentation**, please answer the following **Question**. **Question:** {query_str} **Important Guidelines:** - **Do NOT** use any prior knowledge or exter… view at source ↗

**Figure 3.** Figure 3: User prompt for the BCA Input Component Prompt Template Context Retriever Document Retriever LLM Output [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Flowchart of the parallel system architecture. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: System prompt for the PCA You will be provided with some documentation and supporting context: ===================== **User Documentation** ===================== {user_docs_str} ================================================================== ------------------- **Contextual Information** ------------------- {context_str} ------------------------------------------------------------------ Based **solely**… view at source ↗

**Figure 6.** Figure 6: User prompt for the PCA V. RESULTS A. BCA results After generating responses from the dataset, the hallucination evaluation for the BCA is presented in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation of BCA Hallucination by LLM: Factual [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Results of the human evaluation for BCA: correctness [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 11.** Figure 11: Results of the human evaluation on reasoning for [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

read the original abstract

Operational Technology Cybersecurity (OTCS) continues to be a dominant challenge for critical infrastructure such as railways. As these systems become increasingly vulnerable to malicious attacks due to digitalization, effective documentation and compliance processes are essential to protect these safety-critical systems. This paper proposes a novel system that leverages Large Language Models (LLMs) and multi-stage retrieval to enhance the compliance verification process against standards like IEC 62443 and the rail-specific IEC 63452. We first evaluate a Baseline Compliance Architecture (BCA) for answering OTCS compliance queries, then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards. Through empirical evaluation comparing OpenAI-gpt-4o and Claude-3.5-haiku models in these architectures, we demonstrate that the PCA significantly improves both correctness and reasoning quality in compliance verification. Our research establishes metrics for response correctness, logical reasoning, and hallucination detection, highlighting the strengths and limitations of using LLMs for compliance verification in railway cybersecurity. The results suggest that retrieval-augmented approaches can significantly improve the efficiency and accuracy of compliance assessments, particularly valuable in an industry facing a shortage of cybersecurity expertise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies multi-stage retrieval to railway OT compliance queries and reports gains from the parallel setup on automated metrics, but the results depend on those metrics tracking real expert judgment.

read the letter

Hi colleague, this paper's main point is that a parallel compliance architecture with extra regulatory context lifts LLM answers to OT cybersecurity queries over a baseline, at least according to their scores for correctness, reasoning quality, and hallucination. They test this on GPT-4o and Claude-3.5-haiku using standards like IEC 62443 and the rail-specific IEC 63452. The domain choice fits a real need in critical infrastructure where expertise is short, and they give a head-to-head comparison rather than just describing one system. Defining those three metrics is a practical step for anyone benchmarking LLM outputs on regulatory text. The architectures are described clearly enough to replicate the basic idea. The soft spot is the evaluation. The claimed improvements rest on automated metrics without shown correlation to how a railway OT expert would score the same responses, especially on points like safety function allocation or residual risk. The abstract gives little on query selection, evaluation rubrics, or statistical tests, so it is hard to tell how robust the gains are or whether they would hold on broader operational questions. If the metrics diverge from expert judgment, the results stay suggestive. This would interest readers building LLM tools for compliance checks in transport or energy sectors. They could borrow the multi-stage pattern, but anyone needing proof of practical utility would want more validation work. I would send it to peer review because the setup is concrete and the metric-validity issue is something referees could address directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Parallel Compliance Architecture (PCA) that extends a Baseline Compliance Architecture (BCA) with multi-stage retrieval from regulatory standards (IEC 62443, IEC 63452) to answer Operational Technology Cybersecurity (OTCS) queries for railway systems. It evaluates the two architectures using GPT-4o and Claude-3.5-haiku, reports that PCA yields higher correctness and reasoning quality, and defines automated metrics for correctness, logical reasoning, and hallucination detection.

Significance. If the empirical gains are robust, the work could help address the shortage of OT cybersecurity expertise in critical infrastructure by improving the efficiency of compliance checks against rail-specific standards. The case-study framing and explicit comparison of retrieval-augmented versus baseline LLM prompting are practical strengths.

major comments (2)

[Evaluation / Results] Evaluation section (and associated results tables): the central claim that PCA 'significantly improves both correctness and reasoning quality' rests on automated metrics whose correlation with domain-expert judgments on regulatory compliance is not demonstrated. Without inter-rater agreement, rubric details, or expert validation on the same query set, the reported improvements cannot be confirmed as evidence of better compliance verification.
[Methods] Query selection and dataset construction (Methods): the manuscript provides no description of how the compliance queries were chosen, whether they cover the full distribution of operational railway OT questions (e.g., safety-function allocation, residual-risk statements), or how regulatory excerpts were sampled. This directly affects the generalizability of the PCA improvement claim.

minor comments (2)

[Architecture] Clarify the exact retrieval stages and prompt templates used in PCA versus BCA; a diagram or pseudocode would aid reproducibility.
[Results] The abstract states 'we demonstrate that the PCA significantly improves…' but the results section should report effect sizes, confidence intervals, or statistical tests rather than qualitative descriptors alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and constructive feedback on our paper. We have addressed each of the major comments in detail below. We will make revisions to the manuscript to incorporate clarifications and additional details as outlined in our responses.

read point-by-point responses

Referee: Evaluation section (and associated results tables): the central claim that PCA 'significantly improves both correctness and reasoning quality' rests on automated metrics whose correlation with domain-expert judgments on regulatory compliance is not demonstrated. Without inter-rater agreement, rubric details, or expert validation on the same query set, the reported improvements cannot be confirmed as evidence of better compliance verification.

Authors: We agree that demonstrating correlation between our automated metrics and domain-expert judgments would provide stronger evidence for the improvements. The current manuscript defines the metrics for correctness, logical reasoning, and hallucination detection based on logical and factual criteria suitable for compliance queries. However, we did not perform expert validation or report inter-rater agreement in this study. In the revised version, we will expand the Evaluation section to provide full rubric details and add a dedicated limitations paragraph acknowledging the absence of expert validation and outlining plans for future work in this direction. This will temper the claims appropriately while retaining the value of the comparative results. revision: yes
Referee: Query selection and dataset construction (Methods): the manuscript provides no description of how the compliance queries were chosen, whether they cover the full distribution of operational railway OT questions (e.g., safety-function allocation, residual-risk statements), or how regulatory excerpts were sampled. This directly affects the generalizability of the PCA improvement claim.

Authors: We appreciate this observation. The queries were curated to reflect typical OT cybersecurity compliance inquiries in railway contexts, informed by the standards IEC 62443 and IEC 63452, with an emphasis on practical operational scenarios. Regulatory excerpts were sampled from key sections relevant to the queries. To improve transparency and address generalizability concerns, we will revise the Methods section to include a detailed description of the query selection criteria, the range of topics covered (including safety-function allocation and risk-related statements), and the sampling approach for regulatory documents. We believe this addition will strengthen the manuscript without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation of retrieval architectures on external standards

full rationale

The paper conducts an empirical comparison of Baseline Compliance Architecture (BCA) versus Parallel Compliance Architecture (PCA) using gpt-4o and Claude-3.5-haiku on OTCS queries drawn from IEC 62443 and IEC 63452. It defines automated metrics for correctness, reasoning quality, and hallucination, then reports that PCA yields higher scores. No equations or derivations are present that reduce a claimed result to its own inputs by construction. No load-bearing self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The evaluation uses held-out queries against external regulatory excerpts, rendering the central claim self-contained against benchmarks rather than tautological. This matches the expected honest non-finding for an applied empirical case study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly assumes standard LLM capabilities and retrieval effectiveness from prior literature.

pith-pipeline@v0.9.0 · 5749 in / 958 out tokens · 42882 ms · 2026-05-22T18:22:01.527495+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first evaluate a Baseline Compliance Architecture (BCA) ... then develop an extended approach called Parallel Compliance Architecture (PCA) that incorporates additional context from regulatory standards.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The retrieval component pη(y|x) ... hybrid query mode ... α = 0.75

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments
cs.SE 2025-05 unverdicted novelty 5.0

DRAFT fine-tunes LLMs with a dual-retrieval architecture and semi-automated datasets containing distractors to achieve 7% higher correctness in safety compliance assessments.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

A review on cybersecurity in railways,

R. Kour, A. Patwardhan, A. Thaduri, and R. Karim, “A review on cybersecurity in railways,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit , vol. 237, no. 1, pp. 3–20, 2023

work page 2023
[2]

Change management in digitalised operation and maintenance of rail- way,

V . J ¨agare, R. Karim, P. S ¨oderholm, P.-O. Larsson-Kr ˚aik, and U. Juntti, “Change management in digitalised operation and maintenance of rail- way,” in International Heavy Haul Association (IHHA) STS 2019, 10- 14th June 2019, Narvik, Norway. , 2019, pp. 904–911

work page 2019
[3]

Cyber-physical security risk assessment for train control and monitoring systems,

M. Rekik, C. Gransart, and M. Berbineau, “Cyber-physical security risk assessment for train control and monitoring systems,” in 2018 IEEE Conference on Communications and Network Security (CNS) . IEEE, 2018, pp. 1–9

work page 2018
[4]

Aligning cyber-physical system safety and security,

G. Sabaliauskaite and A. P. Mathur, “Aligning cyber-physical system safety and security,” in Complex Systems Design & Management Asia: Designing Smart Cities: Proceedings of the First Asia-Pacific Confer- ence on Complex Systems Design & Management, CSD&M Asia 2014 . Springer, 2015, pp. 41–53

work page 2014
[5]

emaintenance in railways: Issues and challenges in cybersecurity,

R. Kour, M. Aljumaili, R. Karim, and P. Tretten, “emaintenance in railways: Issues and challenges in cybersecurity,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit, vol. 233, no. 10, pp. 1012–1022, 2019

work page 2019
[6]

Iec62443 suite of standards,

“Iec62443 suite of standards,” 2024. [Online]. Available: https://www.isa.org/standards-and-publications/isa-standards/ isa-iec-62443-series-of-standards

work page 2024
[7]

BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,

BSI Group, “BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,” 2024. [Online]. Available: https://standardsdevelopment. bsigroup.com/projects/2022-01003/section

work page 2024
[8]

Best practices for cybersecurity compliance monitoring,

A. T. Tunggal, “Best practices for cybersecurity compliance monitoring,” 2024, updated April 21, 2024. [Online]. Available: https://www.upguard. com/blog/compliance-monitoring 10

work page 2024
[9]

J. M. Stewart, E. Tittel, and M. Chapple, CISSP: Certified information systems security professional study guide . John Wiley & Sons, 2011

work page 2011
[10]

The promise of automated compliance checking,

R. Amor and J. Dimyadi, “The promise of automated compliance checking,” Developments in the built environment , vol. 5, p. 100039, 2021

work page 2021
[11]

Development of an object model for automated compliance checking,

S. Malsane, J. Matthews, S. Lockley, P. E. Love, and D. Greenwood, “Development of an object model for automated compliance checking,” Automation in construction , vol. 49, pp. 51–58, 2015

work page 2015
[12]

Semantic nlp-based information extrac- tion from construction regulatory documents for automated compliance checking,

J. Zhang and N. M. El-Gohary, “Semantic nlp-based information extrac- tion from construction regulatory documents for automated compliance checking,” Journal of computing in civil engineering , vol. 30, no. 2, p. 04015014, 2016

work page 2016
[13]

An introduction to large language models (llms),

S. Hore, “An introduction to large language models (llms),”

work page
[14]

Available: https://www.analyticsvidhya.com/blog/2023/ 03/an-introduction-to-large-language-models-llms/

[Online]. Available: https://www.analyticsvidhya.com/blog/2023/ 03/an-introduction-to-large-language-models-llms/

work page 2023
[15]

Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models,

S. Jose, K. T. Nguyen, K. Medjaher, R. Zemouri, M. L ´evesque, and A. Tahan, “Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models,” Expert Systems with Applications , vol. 255, p. 124603, 2024

work page 2024
[16]

Language model-guided student performance prediction with multimodal auxiliary information,

C. Oh, M. Park, S. Lim, and K. Song, “Language model-guided student performance prediction with multimodal auxiliary information,” Expert Systems with Applications , vol. 250, p. 123960, 2024

work page 2024
[17]

Prompting gpt–4 to support automatic safety case generation,

M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Prompting gpt–4 to support automatic safety case generation,” Expert Systems with Applications, vol. 255, p. 124653, 2024

work page 2024
[18]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al. , “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Llm as a mastermind: A survey of strategic reasoning with large language models,

Y . Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y . Xia, W. Wu, T. Song, M. Lan, and F. Wei, “Llm as a mastermind: A survey of strategic reasoning with large language models,” arXiv preprint arXiv:2404.01230, 2024

work page arXiv 2024
[21]

Incorporating dialect under- standing into llm using rag and prompt engineering techniques for causal commonsense reasoning,

B. Perak, S. Beliga, and A. Me ˇstrovi´c, “Incorporating dialect under- standing into llm using rag and prompt engineering techniques for causal commonsense reasoning,” in Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) , 2024, pp. 220–229

work page 2024
[22]

Enhancing complex linguistic tasks resolution through fine-tuning llms, rag and knowledge graphs (short paper),

F. Bianchini, M. Calamo, F. De Luzi, M. Macr `ı, and M. Mecella, “Enhancing complex linguistic tasks resolution through fine-tuning llms, rag and knowledge graphs (short paper),” in International Conference on Advanced Information Systems Engineering . Springer, 2024, pp. 147–155

work page 2024
[23]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020

work page 2020
[24]

Retrieval-augmented generation (rag),

G. Cloud, “Retrieval-augmented generation (rag),” 2024, accessed: August 7, 2024. [Online]. Available: https://cloud.google.com/use-cases/ retrieval-augmented-generation?hl=en

work page 2024
[25]

Automated building code compliance checking–where is it at,

J. Dimyadi and R. Amor, “Automated building code compliance checking–where is it at,” Proceedings of CIB WBC , vol. 6, no. 1, 2013

work page 2013
[26]

A gpt-based method of automated compli- ance checking through prompt engineering,

X. Liu, H. Li, and X. Zhu, “A gpt-based method of automated compli- ance checking through prompt engineering,” 2023

work page 2023
[27]

Gpt for rcts?: Using ai to measure adherence to reporting guidelines,

J. G. Wrightson, P. Blazey, K. M. Khan, and C. L. Ardern, “Gpt for rcts?: Using ai to measure adherence to reporting guidelines,” medRxiv, pp. 2023–12, 2023

work page 2023
[28]

Towards standards- compliant assistive technology product specifications via llms,

C. Arora, J. Grundy, L. Puli, and N. Layton, “Towards standards- compliant assistive technology product specifications via llms,” arXiv preprint arXiv:2404.03122, 2024

work page arXiv 2024
[29]

Long- context llms struggle with long in-context learning,

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long- context llms struggle with long in-context learning,” arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024
[30]

Evaluating retrieval quality in retrieval- augmented generation,

A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval- augmented generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, 2024, pp. 2395–2400

work page 2024
[31]

Cohere-embed-english-v3.0,

Cohere, “Cohere-embed-english-v3.0,” https://huggingface.co/Cohere/ Cohere-embed-english-v3.0, Cohere, 2024, accessed: 2025-02-26

work page 2024
[32]

Openai debuts gpt-4o’omni’model now powering chatgpt,

K. Wiggers, “Openai debuts gpt-4o’omni’model now powering chatgpt,” TechCrunch. Retrieved May, vol. 16, p. 2024, 2024

work page 2024
[33]

Automatic detection of llm-generated code: A case study of claude 3 haiku,

M. Rahman, S. Khatoonabadi, A. Abdellatif, and E. Shihab, “Automatic detection of llm-generated code: A case study of claude 3 haiku,” arXiv preprint arXiv:2409.01382, 2024

work page arXiv 2024
[34]

Openrouter: Api for accessing open-source and propri- etary llms,

OpenRouter, “Openrouter: Api for accessing open-source and propri- etary llms,” https://openrouter.ai/, 2023, accessed: 2024-09-20

work page 2023
[35]

LlamaIndex,

J. Liu, “LlamaIndex,” 11 2022. [Online]. Available: https://github.com/ jerryjliu/llama index

work page 2022
[36]

Van Rossum and F

G. Van Rossum and F. L. Drake Jr, Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995

work page 1995
[37]

Introducing llamacloud and llamaparse - llamaindex - build knowledge assistants over your enterprise data,

J. Liu, “Introducing llamacloud and llamaparse - llamaindex - build knowledge assistants over your enterprise data,” Feb 2024. [Online]. Available: https://www.llamaindex.ai/blog/ introducing-llamacloud-and-llamaparse-af8cedf9006b

work page 2024
[38]

(a) i am not a lawyer, but...: Engaging legal experts towards responsible llm policies for legal advice,

I. Cheong, K. Xia, K. K. Feng, Q. Z. Chen, and A. X. Zhang, “(a) i am not a lawyer, but...: Engaging legal experts towards responsible llm policies for legal advice,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024, pp. 2454–2469

work page 2024
[39]

Evaluation of llm agents for the soc tier 1 analyst triage process,

O. Oniagbi, A. Hakkala, and I. Hasanov, “Evaluation of llm agents for the soc tier 1 analyst triage process,” 2024

work page 2024
[40]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[41]

Phoenix: Open-source ml observability and performance debug- ging,

A. AI, “Phoenix: Open-source ml observability and performance debug- ging,” https://github.com/Arize-ai/phoenix, 2023, accessed: 2024-09-20

work page 2023
[42]

LLM Evaluators Recognize and Favor Their Own Generations

A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations,” arXiv preprint arXiv:2404.13076 , 2024

work page internal anchor Pith review arXiv 2024
[43]

Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,

A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,” arXiv preprint arXiv:2406.12624 , 2024

work page arXiv 2024
[44]

Distributed ledger for cybersecurity: issues and challenges in railways,

A. Patwardhan, A. Thaduri, and R. Karim, “Distributed ledger for cybersecurity: issues and challenges in railways,” Sustainability, vol. 13, no. 18, p. 10176, 2021

work page 2021

[1] [1]

A review on cybersecurity in railways,

R. Kour, A. Patwardhan, A. Thaduri, and R. Karim, “A review on cybersecurity in railways,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit , vol. 237, no. 1, pp. 3–20, 2023

work page 2023

[2] [2]

Change management in digitalised operation and maintenance of rail- way,

V . J ¨agare, R. Karim, P. S ¨oderholm, P.-O. Larsson-Kr ˚aik, and U. Juntti, “Change management in digitalised operation and maintenance of rail- way,” in International Heavy Haul Association (IHHA) STS 2019, 10- 14th June 2019, Narvik, Norway. , 2019, pp. 904–911

work page 2019

[3] [3]

Cyber-physical security risk assessment for train control and monitoring systems,

M. Rekik, C. Gransart, and M. Berbineau, “Cyber-physical security risk assessment for train control and monitoring systems,” in 2018 IEEE Conference on Communications and Network Security (CNS) . IEEE, 2018, pp. 1–9

work page 2018

[4] [4]

Aligning cyber-physical system safety and security,

G. Sabaliauskaite and A. P. Mathur, “Aligning cyber-physical system safety and security,” in Complex Systems Design & Management Asia: Designing Smart Cities: Proceedings of the First Asia-Pacific Confer- ence on Complex Systems Design & Management, CSD&M Asia 2014 . Springer, 2015, pp. 41–53

work page 2014

[5] [5]

emaintenance in railways: Issues and challenges in cybersecurity,

R. Kour, M. Aljumaili, R. Karim, and P. Tretten, “emaintenance in railways: Issues and challenges in cybersecurity,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and Rapid Transit, vol. 233, no. 10, pp. 1012–1022, 2019

work page 2019

[6] [6]

Iec62443 suite of standards,

“Iec62443 suite of standards,” 2024. [Online]. Available: https://www.isa.org/standards-and-publications/isa-standards/ isa-iec-62443-series-of-standards

work page 2024

[7] [7]

BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,

BSI Group, “BS EN IEC 63452 Ed.1.0 Railway applications - Cybersecurity,” 2024. [Online]. Available: https://standardsdevelopment. bsigroup.com/projects/2022-01003/section

work page 2024

[8] [8]

Best practices for cybersecurity compliance monitoring,

A. T. Tunggal, “Best practices for cybersecurity compliance monitoring,” 2024, updated April 21, 2024. [Online]. Available: https://www.upguard. com/blog/compliance-monitoring 10

work page 2024

[9] [9]

J. M. Stewart, E. Tittel, and M. Chapple, CISSP: Certified information systems security professional study guide . John Wiley & Sons, 2011

work page 2011

[10] [10]

The promise of automated compliance checking,

R. Amor and J. Dimyadi, “The promise of automated compliance checking,” Developments in the built environment , vol. 5, p. 100039, 2021

work page 2021

[11] [11]

Development of an object model for automated compliance checking,

S. Malsane, J. Matthews, S. Lockley, P. E. Love, and D. Greenwood, “Development of an object model for automated compliance checking,” Automation in construction , vol. 49, pp. 51–58, 2015

work page 2015

[12] [12]

Semantic nlp-based information extrac- tion from construction regulatory documents for automated compliance checking,

J. Zhang and N. M. El-Gohary, “Semantic nlp-based information extrac- tion from construction regulatory documents for automated compliance checking,” Journal of computing in civil engineering , vol. 30, no. 2, p. 04015014, 2016

work page 2016

[13] [13]

An introduction to large language models (llms),

S. Hore, “An introduction to large language models (llms),”

work page

[14] [14]

Available: https://www.analyticsvidhya.com/blog/2023/ 03/an-introduction-to-large-language-models-llms/

[Online]. Available: https://www.analyticsvidhya.com/blog/2023/ 03/an-introduction-to-large-language-models-llms/

work page 2023

[15] [15]

Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models,

S. Jose, K. T. Nguyen, K. Medjaher, R. Zemouri, M. L ´evesque, and A. Tahan, “Advancing multimodal diagnostics: Integrating industrial textual data and domain knowledge with large language models,” Expert Systems with Applications , vol. 255, p. 124603, 2024

work page 2024

[16] [16]

Language model-guided student performance prediction with multimodal auxiliary information,

C. Oh, M. Park, S. Lim, and K. Song, “Language model-guided student performance prediction with multimodal auxiliary information,” Expert Systems with Applications , vol. 250, p. 123960, 2024

work page 2024

[17] [17]

Prompting gpt–4 to support automatic safety case generation,

M. Sivakumar, A. B. Belle, J. Shan, and K. K. Shahandashti, “Prompting gpt–4 to support automatic safety case generation,” Expert Systems with Applications, vol. 255, p. 124653, 2024

work page 2024

[18] [18]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al. , “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Llm as a mastermind: A survey of strategic reasoning with large language models,

Y . Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y . Xia, W. Wu, T. Song, M. Lan, and F. Wei, “Llm as a mastermind: A survey of strategic reasoning with large language models,” arXiv preprint arXiv:2404.01230, 2024

work page arXiv 2024

[21] [21]

Incorporating dialect under- standing into llm using rag and prompt engineering techniques for causal commonsense reasoning,

B. Perak, S. Beliga, and A. Me ˇstrovi´c, “Incorporating dialect under- standing into llm using rag and prompt engineering techniques for causal commonsense reasoning,” in Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) , 2024, pp. 220–229

work page 2024

[22] [22]

Enhancing complex linguistic tasks resolution through fine-tuning llms, rag and knowledge graphs (short paper),

F. Bianchini, M. Calamo, F. De Luzi, M. Macr `ı, and M. Mecella, “Enhancing complex linguistic tasks resolution through fine-tuning llms, rag and knowledge graphs (short paper),” in International Conference on Advanced Information Systems Engineering . Springer, 2024, pp. 147–155

work page 2024

[23] [23]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020

work page 2020

[24] [24]

Retrieval-augmented generation (rag),

G. Cloud, “Retrieval-augmented generation (rag),” 2024, accessed: August 7, 2024. [Online]. Available: https://cloud.google.com/use-cases/ retrieval-augmented-generation?hl=en

work page 2024

[25] [25]

Automated building code compliance checking–where is it at,

J. Dimyadi and R. Amor, “Automated building code compliance checking–where is it at,” Proceedings of CIB WBC , vol. 6, no. 1, 2013

work page 2013

[26] [26]

A gpt-based method of automated compli- ance checking through prompt engineering,

X. Liu, H. Li, and X. Zhu, “A gpt-based method of automated compli- ance checking through prompt engineering,” 2023

work page 2023

[27] [27]

Gpt for rcts?: Using ai to measure adherence to reporting guidelines,

J. G. Wrightson, P. Blazey, K. M. Khan, and C. L. Ardern, “Gpt for rcts?: Using ai to measure adherence to reporting guidelines,” medRxiv, pp. 2023–12, 2023

work page 2023

[28] [28]

Towards standards- compliant assistive technology product specifications via llms,

C. Arora, J. Grundy, L. Puli, and N. Layton, “Towards standards- compliant assistive technology product specifications via llms,” arXiv preprint arXiv:2404.03122, 2024

work page arXiv 2024

[29] [29]

Long- context llms struggle with long in-context learning,

T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long- context llms struggle with long in-context learning,” arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024

[30] [30]

Evaluating retrieval quality in retrieval- augmented generation,

A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval- augmented generation,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, 2024, pp. 2395–2400

work page 2024

[31] [31]

Cohere-embed-english-v3.0,

Cohere, “Cohere-embed-english-v3.0,” https://huggingface.co/Cohere/ Cohere-embed-english-v3.0, Cohere, 2024, accessed: 2025-02-26

work page 2024

[32] [32]

Openai debuts gpt-4o’omni’model now powering chatgpt,

K. Wiggers, “Openai debuts gpt-4o’omni’model now powering chatgpt,” TechCrunch. Retrieved May, vol. 16, p. 2024, 2024

work page 2024

[33] [33]

Automatic detection of llm-generated code: A case study of claude 3 haiku,

M. Rahman, S. Khatoonabadi, A. Abdellatif, and E. Shihab, “Automatic detection of llm-generated code: A case study of claude 3 haiku,” arXiv preprint arXiv:2409.01382, 2024

work page arXiv 2024

[34] [34]

Openrouter: Api for accessing open-source and propri- etary llms,

OpenRouter, “Openrouter: Api for accessing open-source and propri- etary llms,” https://openrouter.ai/, 2023, accessed: 2024-09-20

work page 2023

[35] [35]

LlamaIndex,

J. Liu, “LlamaIndex,” 11 2022. [Online]. Available: https://github.com/ jerryjliu/llama index

work page 2022

[36] [36]

Van Rossum and F

G. Van Rossum and F. L. Drake Jr, Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995

work page 1995

[37] [37]

Introducing llamacloud and llamaparse - llamaindex - build knowledge assistants over your enterprise data,

J. Liu, “Introducing llamacloud and llamaparse - llamaindex - build knowledge assistants over your enterprise data,” Feb 2024. [Online]. Available: https://www.llamaindex.ai/blog/ introducing-llamacloud-and-llamaparse-af8cedf9006b

work page 2024

[38] [38]

(a) i am not a lawyer, but...: Engaging legal experts towards responsible llm policies for legal advice,

I. Cheong, K. Xia, K. K. Feng, Q. Z. Chen, and A. X. Zhang, “(a) i am not a lawyer, but...: Engaging legal experts towards responsible llm policies for legal advice,” in The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024, pp. 2454–2469

work page 2024

[39] [39]

Evaluation of llm agents for the soc tier 1 analyst triage process,

O. Oniagbi, A. Hakkala, and I. Hasanov, “Evaluation of llm agents for the soc tier 1 analyst triage process,” 2024

work page 2024

[40] [40]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023

[41] [41]

Phoenix: Open-source ml observability and performance debug- ging,

A. AI, “Phoenix: Open-source ml observability and performance debug- ging,” https://github.com/Arize-ai/phoenix, 2023, accessed: 2024-09-20

work page 2023

[42] [42]

LLM Evaluators Recognize and Favor Their Own Generations

A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations,” arXiv preprint arXiv:2404.13076 , 2024

work page internal anchor Pith review arXiv 2024

[43] [43]

Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,

A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabili- ties in llms-as-judges,” arXiv preprint arXiv:2406.12624 , 2024

work page arXiv 2024

[44] [44]

Distributed ledger for cybersecurity: issues and challenges in railways,

A. Patwardhan, A. Thaduri, and R. Karim, “Distributed ledger for cybersecurity: issues and challenges in railways,” Sustainability, vol. 13, no. 18, p. 10176, 2021

work page 2021