Recognition: unknown
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
Pith reviewed 2026-05-08 16:09 UTC · model grok-4.3
The pith
A deterministic verifier removes 466 non-compliant approval-gated actions from LLM-generated incident plans without lowering recall on core tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SOCpilot shows that by holding the incident package, action catalog, policy rules, verifier, and evidence surface constant, a deterministic checker can measure and enforce compliance on LLM-proposed action traces. Applied to plans from two providers across 200 incidents, the verifier removes 466 actions that breach approval gates, particularly for recovery and containment, while baseline-task recall stays level with analyst-authored SOAR references. Aggregate compliance rates remain unchanged across three reruns of the identical corpus. The system also exposes zero-cost checks for mandatory-step and ordering repairs.
What carries the argument
The deterministic verifier that inspects the full action trace against fixed policy rules encoding mandatory steps, required ordering, and explicit approval gates.
If this is right
- Compliance becomes checkable at the plan boundary before any analyst sees the LLM output.
- The same fixed rules and verifier can be applied to outputs from different LLM providers to compare their compliance behavior under identical policy text.
- Focus on approval-gated actions for recovery and containment isolates the highest-risk decisions in the response workflow.
- Releasing the runnable artifact makes the compliance counts independently reproducible without access to private incident data.
- Zero-cost readiness checks for ordering and mandatory-step repairs can be run on any new plan without additional LLM calls.
Where Pith is reading between the lines
- The fixed-corpus design could be reused to test whether prompt changes or fine-tuning reduce the number of flagged actions before the verifier runs.
- Similar boundary verifiers could be built for other regulated domains where LLMs generate action sequences that must respect approval or ordering constraints.
- Integrating the repair suggestions directly into a SOAR platform would let analysts accept or override the deterministic fixes in one step.
Load-bearing premise
The policy rules and action catalog given to the verifier correctly and completely capture every mandatory compliance requirement that applies to the 200 incidents.
What would settle it
An independent audit by the original SOC analysts that finds any of the 466 removed actions were actually performed in the paired human reference plans or were required under the real operational policy.
Figures
read the original abstract
Security operations centers (SOCs) are beginning to use large language models (LLMs) as copilots to draft incident-response plans. These plans may include actions that are valid per the catalog but still violate mandatory steps, required ordering, or approval gates before analyst review. SOCpilot makes this compliance question measurable at the plan boundary. It fixes the incident package, action catalog, policy rules, verifier, and public evidence surface. Next, it verifies the copilot's proposed action trace. We evaluate two LLM providers on 200 real incidents from an anonymized production SOC in a financial-sector case study. We compare their plans to paired analyst-authored references from the same security orchestration, automation, and response (SOAR) cases. An identical inline policy text moves the two providers in opposite directions. A deterministic verifier removes 466 non-compliant, approval-gated actions, without reducing baseline-task recall. Aggregate rates remain stable across 3 reruns of the fixed corpus. The official evidence focuses on approval-gated decisions regarding recovery and containment. Separately, the artifact exposes zero-cost readiness checks for mandatory and ordering repairs. We release the runnable artifact so independent reviewers can rederive the public results without access to private incident data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SOCpilot as a framework to make policy compliance measurable for LLM-generated incident-response plans in SOCs. It fixes the incident package, action catalog, inline policy text, and deterministic verifier, then evaluates two LLM providers on 200 real incidents from an anonymized financial-sector production SOC. Plans are compared against paired analyst-authored SOAR references; the verifier removes 466 non-compliant approval-gated actions without reducing baseline recall, with aggregate rates stable across three reruns of the fixed corpus. The artifact is released to allow independent rederivation of public results.
Significance. If the central results hold, the work supplies a practical, reproducible method for quantifying compliance gaps in LLM copilots for security operations, using real production incidents and analyst references. The release of the runnable artifact is a clear strength, enabling external verification of the reported counts and stability without access to private data.
major comments (2)
- [Abstract] Abstract and evaluation description: the reported removal of 466 non-compliant approval-gated actions (and the claim of no recall reduction) is produced by a deterministic verifier whose encoding of mandatory steps, ordering constraints, and approval gates is not specified or validated. The manuscript states that an identical inline policy text is used but provides no derivation, pseudocode, or cross-check against actual production SOC mandatory requirements, so any mismatch in the fixed rules directly alters the 466 count and stability claim.
- [Evaluation] Evaluation section: the stability of aggregate rates across three reruns is asserted for the fixed corpus, yet no description is given of how the verifier handles ordering or approval gates in the action trace, nor are error bars or per-incident breakdowns supplied. This leaves the load-bearing quantitative claim dependent on an unreviewed implementation.
minor comments (1)
- The abstract refers to 'zero-cost readiness checks for mandatory and ordering repairs' without defining what these checks consist of or how they are exposed in the artifact.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and the value of the released artifact. We agree that the manuscript would benefit from greater detail on the verifier and will revise accordingly to make the quantitative claims more transparent and self-contained.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: the reported removal of 466 non-compliant approval-gated actions (and the claim of no recall reduction) is produced by a deterministic verifier whose encoding of mandatory steps, ordering constraints, and approval gates is not specified or validated. The manuscript states that an identical inline policy text is used but provides no derivation, pseudocode, or cross-check against actual production SOC mandatory requirements, so any mismatch in the fixed rules directly alters the 466 count and stability claim.
Authors: We acknowledge the need for more explicit specification. In the revised manuscript we will add a dedicated subsection to the Evaluation section containing: (1) pseudocode for the deterministic verifier, (2) a precise description of how mandatory steps, ordering constraints, and approval gates are encoded and enforced on action traces, and (3) the process by which the inline policy text was derived from the anonymized SOC's documented procedures. The complete verifier source, exact policy text, and all public evaluation artifacts are already released, permitting independent confirmation of the 466 count and recall preservation. We cannot release the original confidential internal SOC documents, but the inline text is a direct encoding of the mandatory requirements used in the study. revision: yes
-
Referee: [Evaluation] Evaluation section: the stability of aggregate rates across three reruns is asserted for the fixed corpus, yet no description is given of how the verifier handles ordering or approval gates in the action trace, nor are error bars or per-incident breakdowns supplied. This leaves the load-bearing quantitative claim dependent on an unreviewed implementation.
Authors: We will expand the Evaluation section to describe the verifier's handling of ordering and approval gates (cross-referenced to the new pseudocode subsection). Because the verifier is deterministic and the corpus of plans is fixed, aggregate rates are identical across the three reruns; we will state this explicitly. We will also add per-incident compliance breakdowns (as a table or supplementary figure) and note that error bars are inapplicable for a deterministic verifier on fixed input. These changes will make the stability claim fully reviewable from the manuscript itself while the artifact continues to support full re-execution. revision: yes
Circularity Check
No significant circularity; empirical evaluation uses external incidents and deterministic verification.
full rationale
The paper reports an empirical evaluation on 200 real incidents from an anonymized production SOC, comparing LLM-generated plans against paired analyst-authored references using a fixed, deterministic verifier and inline policy text. No equations, derivations, or fitted parameters are described that would reduce the reported compliance counts (e.g., removal of 466 actions) or recall metrics to the inputs by construction. The central claims rest on direct application of the verifier to external data and references, with the artifact released for independent rederivation. This matches the default expectation of no circularity for papers grounded in external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The fixed policy rules and action catalog accurately represent all mandatory compliance constraints for the evaluated incidents.
invented entities (1)
-
SOCpilot verifier
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Matched and mismatched socs: A qualitative study on security operations center issues,
F. B. Kokulu, A. Soneji, T. Bao, Y . Shoshitaishvili, Z. Zhao, A. Doupe, and G.-J. Ahn, “Matched and mismatched socs: A qualitative study on security operations center issues,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2019, pp. 1955–1970. [Online]. Availabl...
-
[2]
99% false positives: A qualitative study of soc analysts’ perspectives on security alarms,
B. A. AlAhmadi, L. Axon, and I. Martinovic, “99% false positives: A qualitative study of soc analysts’ perspectives on security alarms,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2022, pp. 2783–2800. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity22/presentation/alahmadi
2022
-
[3]
In: Proceed- ings of the 2023 ACM SIGSAC Conference on Computer and Communica- tions Security
M. Vermeer, N. Kadenko, M. van Eeten, C. Ganan, and S. Parkin, “Alert alchemy: Soc workflows and decisions in the management of nids rules,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 2770–2784. [Online]. Available: https://doi.org/10.1145/3576915.3616581
-
[4]
Lessons lost: Incident response in the age of cyber insurance and breach attorneys,
D. W. Woods, R. B ¨ohme, J. Wolff, and D. Schwarcz, “Lessons lost: Incident response in the age of cyber insurance and breach attorneys,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2023, pp. 2259–2273. [Online]. Available: https://www.usenix.org/conference/usenixsecurity23/presentation/woods
2023
-
[5]
Do you play it by the books? a study on incident response playbooks and influencing factors,
D. Schlette, P. Empl, M. Caselli, T. Schreck, and G. Pernul, “Do you play it by the books? a study on incident response playbooks and influencing factors,” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2024, pp. 3625–3643. [Online]. Available: https://doi.org/10.1109/SP54263.2024.00060
-
[6]
Pentestgpt: Evaluating and harnessing large language models for automated penetration testing,
G. Deng, Y . Liu, V . M. Vilches, P. Liu, Y . Li, Y . Xu, M. Pinzger, S. Rass, T. Zhang, and Y . Liu, “Pentestgpt: Evaluating and harnessing large language models for automated penetration testing,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2024. [Online]. Available: https : / / www. usenix . org / conference/usenixsecurity24/pre...
2024
-
[7]
Yurascanner: Leveraging llms for task-driven web app scanning,
A. Stafeev, T. Recktenwald, G. D. Stefano, S. Khodayari, and G. Pellegrino, “Yurascanner: Leveraging llms for task-driven web app scanning,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–16. [Online]. Available: https://www.ndss- symposium. org/ndss-paper/yurascanner-leve...
2025
-
[8]
Prompt inversion attack against collaborative inference of large language models,
D. Wang, G. Zhou, X. Li, Y . Bai, L. Chen, T. Qin, J. Sun, and D. Li, “The digital cybersecurity expert: How far have we come?” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2025, pp. 3273–3290. [Online]. Available: https://doi.org/10.1109/SP61157.2025.00198
-
[9]
Progent: Securing AI Agents with Privilege Control
T. Shi, J. He, Z. Wang, H. Li, L. Wu, W. Guo, and D. Song, “Pro- gent: Programmable privilege control for llm agents,”arXiv preprint arXiv:2504.11703, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
H. Wang, C. M. Poskitt, and J. Sun, “Agentspec: Customizable runtime enforcement for safe and reliable llm agents,” 2025. [Online]. Available: https://arxiv.org/abs/2503.18666
work page internal anchor Pith review arXiv 2025
-
[11]
Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed
J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988
1988
-
[12]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec). New York, NY , USA: Association for Computing Machinery, 2023, pp. 79–90. [Onlin...
-
[13]
I know what you asked: Prompt leakage via kv- cache sharing in multi-tenant llm serving,
G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang, “I know what you asked: Prompt leakage via kv- cache sharing in multi-tenant llm serving,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025. [Online]. Available: https: //www.ndss-symposium.org/wp-content/uploads/2...
2025
-
[14]
Prompt inversion attack against collaborative inference of large language models,
W. Qu, Y . Zhou, Y . Wu, T. Xiao, B. Yuan, Y . Li, and J. Zhang, “Prompt inversion attack against collaborative inference of large language models,” inIEEE Symposium on Security and Privacy (S&P). Los Alamitos, CA, USA: IEEE, 2025, pp. 1695–1712. [Online]. Available: https://doi.org/10.1109/SP61157.2025.00160
-
[15]
The philosopher’s stone: Trojaning plugins of large language models,
T. Dong, M. Xue, G. Chen, R. Holland, Y . Meng, S. Li, Z. Liu, and H. Zhu, “The philosopher’s stone: Trojaning plugins of large language models,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025. [Online]. Available: https://www.ndss-symposium.org/ndss-paper/the- philosophers-stone...
2025
-
[16]
Isolategpt: An execution isolation architecture for llm-based agentic systems,
Y . Wu, F. Roesner, T. Kohno, N. Zhang, and U. Iqbal, “Isolategpt: An execution isolation architecture for llm-based agentic systems,” in Proceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–20. [Online]. Available: https://www.ndss- symposium.org/wp- content/ uploads/2025-1131-paper.pdf
2025
-
[17]
Enforceable security policies,
F. B. Schneider, “Enforceable security policies,”ACM Transactions on Information and System Security (TISSEC), vol. 3, no. 1, pp. 30–50,
-
[18]
Available: https://doi.org/10.1145/353323.353382
[Online]. Available: https://doi.org/10.1145/353323.353382
-
[19]
Edit automata: Enforcement mechanisms for run-time security policies,
J. Ligatti, L. Bauer, and D. Walker, “Edit automata: Enforcement mechanisms for run-time security policies,”International Journal of Information Security, vol. 4, no. 1, pp. 2–16, 2005. [Online]. Available: https://doi.org/10.1007/s10207-004-0046-8
-
[20]
Kinetic: Verifiable dynamic network control,
H. Kim, J. Reich, A. Gupta, M. Shahbaz, N. Feamster, and R. Clark, “Kinetic: Verifiable dynamic network control,” inUSENIX Symposium on Networked Systems Design and Implementation (NSDI). Berkeley, CA, USA: USENIX Association, 2015, pp. 59–72. [Online]. Available: https://www.usenix.org/conference/nsdi15/technical-sessions/ presentation/kim
2015
-
[21]
Psi: Precise security instrumentation for enterprise networks,
T. Yu, S. K. Fayaz, M. Collins, V . Sekar, and S. Seshan, “Psi: Precise security instrumentation for enterprise networks,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2017, pp. 1–15. [Online]. Available: https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/psi- precise-secur...
2017
-
[22]
Sok: Towards a unified approach to applied replicability for computer security,
D. Olszewski, T. Tucker, K. R. B. Butler, and P. Traynor, “Sok: Towards a unified approach to applied replicability for computer security,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2025, pp. 469–488. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity25/presentation/olszewski
2025
-
[23]
Probable inference, the law of succession, and statistical inference,
E. B. Wilson, “Probable inference, the law of succession, and statistical inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927. [Online]. Available: https: //doi.org/10.1080/01621459.1927.10502953
-
[24]
Psychometrika12(2), 153–157 (1947)https://doi
Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947. [Online]. Available: https://doi.org/10.1007/BF02295996
-
[25]
S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979. [Online]. Available: https://www.jstor.org/stable/4615733
-
[26]
Work-from-home and covid-19: Trajectories of endpoint security management in a security operations center,
K. R. Jones, D. A. Brucker-Hahn, B. Fidler, and A. G. Bardas, “Work-from-home and covid-19: Trajectories of endpoint security management in a security operations center,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2023, pp. 2293–2310. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity23/presentation/jones
2023
-
[27]
M. Vermeer, M. van Eeten, and C. Ga ˜n´an, “Ruling the rules: Quantifying the evolution of rulesets, alerts and incidents in network intrusion detection,” inProceedings of the ACM Asia Conference on Computer and Communications Security (AsiaCCS). New York, NY , USA: Association for Computing Machinery, 2022, pp. 799–814. [Online]. Available: https://doi.o...
-
[28]
True attacks, attack attempts, or benign triggers? an empirical measurement of network alerts in a security operations center,
L. Yang, Z. Chen, C. Wang, Z. Zhang, S. Booma, P. Cao, C. Adam, A. Withers, Z. Kalbarczyk, R. K. Iyer, and G. Wang, “True attacks, attack attempts, or benign triggers? an empirical measurement of network alerts in a security operations center,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2024, pp. 1525–1542. [Online]. Available: ht...
2024
-
[29]
Provg-searcher: A graph representation learning approach for efficient provenance graph search,
E. Altinisik, F. Deniz, and H. T. Sencar, “Provg-searcher: A graph representation learning approach for efficient provenance graph search,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 2247–2261. [Online]. Available: https://doi.org/10.1145/3576915.3623187
-
[30]
Tapas: An efficient online apt detection with task-guided process provenance graph segmentation and analysis,
B. Zhang, Y . Gao, C. Yu, B. Kuang, Z. Zhang, H. Kim, and A. Fu, “Tapas: An efficient online apt detection with task-guided process provenance graph segmentation and analysis,” inUSENIX Security Symposium. Berkeley, CA, USA: USENIX Association, 2025, pp. 607–624. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity25/presentation/zhang-bo-tapas
2025
-
[31]
In: Proceed- ings of the 2023 ACM SIGSAC Conference on Computer and Communica- tions Security
F. Dong, S. Li, P. Jiang, D. Li, H. Wang, L. Huang, X. Xiao, J. Chen, X. Luo, Y . Guo, and X. Chen, “Are we there yet? an industrial viewpoint on provenance-based endpoint detection and response tools,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2023, pp. 239...
-
[32]
Ocr-apt: Reconstructing apt stories from audit logs using subgraph anomaly detection and llms,
A. Aly, E. Mansour, and A. M. Youssef, “Ocr-apt: Reconstructing apt stories from audit logs using subgraph anomaly detection and llms,” inProceedings of the ACM SIGSAC Conference on Computer and Communications Security. New York, NY , USA: Association for Computing Machinery, 2025, pp. 261–275. [Online]. Available: https://doi.org/10.1145/3719027.3765219
-
[33]
Raconteur: A knowledgeable, insightful, and portable llm-powered shell command explainer,
J. Deng, X. Li, Y . Chen, Y . Bai, H. Weng, Y . Liu, T. Wei, and W. Xu, “Raconteur: A knowledgeable, insightful, and portable llm-powered shell command explainer,” inProceedings of the Network and Distributed System Security Symposium (NDSS). Reston, V A, USA: The Internet Society, 2025, pp. 1–18. [Online]. Available: https: //www.ndss- symposium.org/ndss...
2025
-
[34]
M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapil, D. Molnar, S. Whitman, and J. Saxe, “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.13161
-
[35]
Towards enforcing company policy adherence in agentic workflows,
N. Zwerdling, D. Boaz, D. Amid, E. Rabinovich, A. Anaby-Tavor, and G. Uziel, “Towards enforcing company policy adherence in agentic workflows,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025. [Online]. Available: https://aclanthology.org/2025.emnlp-industry.41/
2025
-
[36]
Formal Policy Enforcement for Real-World Agentic Systems
N. Palumbo, S. Choudhary, J. Choi, P. Chalasani, and S. Jha, “Policy compiler for secure agentic systems,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16708
work page internal anchor Pith review arXiv 2026
-
[37]
Veriguard: Enhancing llm agent safety via verified code generation,
L. Miculicich, M. Parmar, H. Palangi, K. D. Dvijotham, M. Montanari, T. Pfister, and L. T. Le, “Veriguard: Enhancing llm agent safety via verified code generation,” 2025. [Online]. Available: https : //arxiv.org/abs/2510.05156
-
[38]
Shieldagent: Shielding agents via verifiable safety policy reasoning
Z. Chen, M. Kang, and B. Li, “Shieldagent: Shielding agents via verifiable safety policy reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22738
-
[39]
D. Song, Y . Huang, B. Chen, T. Cong, R. Goebel, L. Ma, and F. Khomh, “Evaluating implicit regulatory compliance in llm tool invocation via logic-guided synthesis,” 2026. [Online]. Available: https://arxiv.org/abs/2601.08196
-
[40]
Cacao security playbooks version 2.0,
B. Jordan and A. Thomson, “Cacao security playbooks version 2.0,” OASIS Committee Specification 01, Nov. 2023, oASIS Open. [Online]. Available: https://docs.oasis-open.org/cacao/security-playbooks/v2.0/ cs01/security-playbooks-v2.0-cs01.html APPENDIX OPENSCIENCE The paper’s main claims are carried in the body. This appendix is supplementary: it states the...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.