GenAI-Driven Threat Detection with Microsoft Security Copilot
Pith reviewed 2026-05-25 06:27 UTC · model grok-4.3
The pith
Dynamic Threat Detection Agent identifies missed malicious activity with 80.1% precision using LLM investigations at production scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DTDA combines a unified activity timeline, versioned LLM prompt contracts with schema validation and bounded retries, a planner-executor investigation loop that generates attack-specific hypotheses and gathers evidence, and dynamic alert generation with titles, severities, MITRE mappings, and remediation guidance. When integrated into Microsoft Security Copilot and run continuously, it achieves 80.1% precision from customer feedback, produces novel alerts for approximately 15% of investigated incidents, recovers hidden malicious activity at 0.78 F1 with GPT-5.4, and processes single-incident investigations in a median of 28 minutes at a median token cost of USD 2.04 with a 0.38% job-level<f|
What carries the argument
The planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, backed by versioned LLM prompt contracts for schema validation and fail-closed behavior.
If this is right
- Security teams receive context-relevant alerts with MITRE mappings and remediation steps without manually translating every new attacker technique.
- The system runs continuously across tens of thousands of customers while keeping median investigation time at 28 minutes and failure rate below 0.4%.
- Offline F1 improves by 0.12 when moving from GPT-4.1 to GPT-5.4, indicating gains from stronger models inside the same loop.
- Novel alerts appear in 15% of cases, showing the loop can surface activity missed by existing detectors.
- Token cost stays at a median of USD 2.04 per incident, supporting repeated use at industry scale.
Where Pith is reading between the lines
- The same planner-executor structure with prompt contracts could be tested on other security platforms to check whether the reliability mechanisms transfer.
- If the 15% novel-alert rate holds across more environments, it would imply that current rule-based and analytics systems systematically leave detectable gaps.
- The bounded-retry and fail-closed design might serve as a practical template for deploying LLM agents in other regulated domains where errors carry high cost.
- Longer deployments could reveal whether the agent begins to surface patterns in tradecraft that human analysts have not yet documented.
Load-bearing premise
Customer feedback on generated alerts constitutes an unbiased measure of true precision.
What would settle it
An independent audit that labels a held-out set of incidents as malicious or benign through forensic review and then measures how many DTDA-generated alerts match those labels.
Figures
read the original abstract
Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Dynamic Threat Detection Agent (DTDA), an always-on LLM-based agent integrated into Microsoft Security Copilot that uses a unified activity timeline, versioned prompt contracts, a planner-executor investigation loop, and dynamic alert generation to investigate incidents in Microsoft Defender. It reports a 120-day online evaluation yielding 80.1% precision from customer feedback and novel alerts in ~15% of cases, plus offline results of 0.78 F1 (GPT-5.4) that improve 0.12 F1 over GPT-4.1 and 0.26 F1 over baseline, with median 28-minute investigations at $2.04 token cost and 0.38% failure rate.
Significance. If the evaluation protocols are rigorously documented and the performance numbers hold under independent scrutiny, the work provides concrete evidence that autonomous agents can surface missed threats at production scale across tens of thousands of customers. The architecture (prompt contracts + planner-executor loop) supplies a reproducible template that could influence future security tooling, and the operational metrics (latency, cost, failure rate) offer practical deployment insights. The absence of disclosed free parameters or circular derivations is a positive feature of the reported figures.
major comments (4)
- [Abstract] Abstract (online evaluation paragraph): The 80.1% precision figure is presented as a direct measurement from customer feedback, yet no protocol is supplied for feedback solicitation, response rate, stratification by incident severity/type, or bias controls. This renders the central online claim unverifiable and matches the weakest assumption identified in the review.
- [Abstract] Abstract (offline evaluation paragraph): The baseline comparator is undefined, and the text does not state that only the planner-executor loop and prompt contracts were varied while holding all other factors fixed. Consequently the 0.26 F1 attribution cannot be isolated and the 0.12 F1 GPT-5.4 vs. GPT-4.1 gain lacks a controlled experimental frame.
- [Abstract] Abstract (evaluation paragraphs): No error bars, confidence intervals, or statistical significance tests accompany the reported F1 scores, precision, or improvement deltas, preventing assessment of whether the observed gains exceed sampling variability.
- [Evaluation] Evaluation section: The manuscript must supply an explicit methods subsection detailing data collection, labeling, and filtering procedures for both the online customer-feedback pipeline and the offline hidden-activity recovery task so that the reported numbers can be reproduced or audited.
minor comments (2)
- [Abstract] Abstract: The model designations 'GPT-5.4' and 'GPT-4.1' should be clarified (internal aliases, fine-tunes, or release versions) to allow readers to interpret the reported deltas.
- [Abstract] Operational metrics: The median latency, cost, and failure rate are given without the underlying sample size or inter-quartile range, which would improve interpretability of the scale claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important areas for improving the transparency of our evaluation protocols. We will revise the manuscript to address each point, adding the requested details on methods, controls, and statistical reporting while respecting data privacy constraints inherent to production customer data.
read point-by-point responses
-
Referee: [Abstract] Abstract (online evaluation paragraph): The 80.1% precision figure is presented as a direct measurement from customer feedback, yet no protocol is supplied for feedback solicitation, response rate, stratification by incident severity/type, or bias controls. This renders the central online claim unverifiable and matches the weakest assumption identified in the review.
Authors: We agree that the customer feedback protocol requires explicit documentation. In the revised manuscript we will add a dedicated subsection under Evaluation that describes the feedback solicitation process (including how customers are prompted within the Copilot interface), available response-rate statistics, stratification by incident severity and type where applicable, and steps taken to reduce selection bias. Due to customer privacy policies, certain granular details will necessarily remain at an aggregated level, but the high-level protocol will be fully specified. revision: yes
-
Referee: [Abstract] Abstract (offline evaluation paragraph): The baseline comparator is undefined, and the text does not state that only the planner-executor loop and prompt contracts were varied while holding all other factors fixed. Consequently the 0.26 F1 attribution cannot be isolated and the 0.12 F1 GPT-5.4 vs. GPT-4.1 gain lacks a controlled experimental frame.
Authors: We acknowledge the need for clearer definition and experimental controls. The baseline is a static rule-based detector that does not employ the planner-executor loop or versioned prompt contracts. In the revision we will (1) explicitly name and describe the baseline in both the abstract and Evaluation section, and (2) state that all other components (timeline construction, alert schema, model temperature, etc.) were held constant across the GPT-5.4, GPT-4.1, and baseline conditions so that the reported F1 deltas isolate the contribution of the planner-executor architecture and prompt contracts. revision: yes
-
Referee: [Abstract] Abstract (evaluation paragraphs): No error bars, confidence intervals, or statistical significance tests accompany the reported F1 scores, precision, or improvement deltas, preventing assessment of whether the observed gains exceed sampling variability.
Authors: We will add the requested statistical reporting. The revised Evaluation section will include 95% confidence intervals (via bootstrap resampling) for all F1 and precision figures, error bars on the reported deltas, and the results of paired statistical tests (McNemar or bootstrap significance tests) comparing the GPT-5.4 configuration against both GPT-4.1 and the baseline. These additions will allow readers to evaluate whether the observed improvements exceed sampling variability. revision: yes
-
Referee: [Evaluation] Evaluation section: The manuscript must supply an explicit methods subsection detailing data collection, labeling, and filtering procedures for both the online customer-feedback pipeline and the offline hidden-activity recovery task so that the reported numbers can be reproduced or audited.
Authors: We concur that an explicit methods subsection is required. We will insert a new 'Evaluation Methods' subsection that details: (a) the online pipeline (incident sampling criteria, customer feedback collection mechanism, labeling of true/false positives, and any filtering for duplicate or low-quality incidents); and (b) the offline hidden-activity recovery task (construction of the ground-truth set, labeling protocol, temporal hold-out rules, and filtering steps). This will enable external reproduction or audit to the extent permitted by Microsoft’s data-access and privacy policies. revision: yes
Circularity Check
No circularity: results are direct empirical measurements
full rationale
The paper reports performance figures (80.1% precision from customer feedback, 0.78 F1 offline, 0.12/0.26 F1 gains) as outcomes of a 120-day deployment and separate offline runs. These quantities are obtained by running the described DTDA system on real incidents and comparing against baselines; they are not obtained by fitting parameters inside the same equations or by renaming fitted values as predictions. No equations, ansatzes, or uniqueness theorems appear in the abstract or described content. The central claims rest on external data collection (customer feedback, offline labeling) rather than internal self-definition or self-citation chains, satisfying the condition for a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based planner-executor loops can generate and validate attack hypotheses from a unified activity timeline without systematic bias or hallucination when prompt contracts are used.
invented entities (1)
-
Dynamic Threat Detection Agent (DTDA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hilala Alturkistani and Mohammed A El-Affendi. 2022. Optimizing cybersecurity incident response decisions using deep reinforcement learning.International Journal of Electrical and Computer Engineering12, 6 (2022), 6768
work page 2022
-
[2]
Microsoft Azure. 2026. Content Safety in Foundry Control Plane. https://azure. microsoft.com/en-us/products/ai-services/ai-content-safety/
work page 2026
-
[3]
Kritan Banstola, Faayed Al Faisal, and Xinming Ou. 2026. Experiences of Using Agentic AI to Fill Tooling Gaps in a Security Operations Center. InWorkshop on Security Operation Center Operations and Construction (WOSOC)
work page 2026
-
[4]
Even Eilertsen, Vasileios Mavroeidis, and Gudmund Grov. 2025. Towards Agentic Investigation of Security Alerts. In2025 IEEE International Conference on Big Data (BigData). IEEE, 7793–7802
work page 2025
-
[5]
Erez Einav. 2023. Introducing a Unified Security Operations Platform with Microsoft Sentinel and Defender XDR
work page 2023
-
[6]
Huwaida Tagelsir Elshoush and Izzeldin Mohamed Osman. 2013. Intrusion alert correlation framework: An innovative approach. InIAENG Transactions on Engineering Technologies: Special Volume of the World Congress on Engineering
work page 2013
-
[7]
Bingrui Foo, Y-S Wu, Y-C Mao, Saurabh Bagchi, and Eugene Spafford. 2005. ADEPTS: Adaptive intrusion response using attack graphs in an e-commerce en- vironment. In2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, 508–517
work page 2005
-
[8]
Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs, Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. 2020. SecBot: a business-driven conversational agent for cybersecurity planning and management. In2020 16th international conference on network and service man- agement (CNSM). IEEE, 1–7
work page 2020
-
[9]
Scott Freitas and Amir Gharib. 2024. GraphWeaver: Billion-Scale Cybersecurity Incident Correlation. Inproceedings of the 33th ACM international CIKM
work page 2024
-
[10]
Scott Freitas and Amir Gharib. 2025. Web scale graph mining for cyber threat intelligence. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4447–4456
work page 2025
-
[11]
Scott Freitas, Jovan Kalajdjieski, Amir Gharib, and Robert McCann. 2025. AI- driven guided response for security operation centers with Microsoft Copilot for Security. InCompanion Proceedings of the ACM on Web Conference 2025. 191–200
work page 2025
-
[12]
Scott Freitas, Andrew Wicker, Duen Horng Chau, and Joshua Neil. 2020. D2M: Dy- namic defense and modeling of adversarial movement in networks. InProceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 541–549
work page 2020
- [13]
-
[14]
Google. 2026. The agentic SOC. https://cloud.google.com/solutions/security/ agentic-soc
work page 2026
-
[15]
Google. 2026. Use Triage and Investigation Agent to investigate alerts. https: //docs.cloud.google.com/chronicle/docs/secops/triage-investigation-agent
work page 2026
-
[16]
Gustavo Gonzalez Granadillo, Mohammed El-Barbori, and Herve Debar. 2016. New types of alert correlation for security information and event management systems. In2016 8th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–7
work page 2016
-
[17]
Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer
- [18]
-
[19]
Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020. Tactical provenance analysis for endpoint detection and response systems. In2020 IEEE symposium on security and privacy (SP). IEEE, 1172–1189
work page 2020
-
[20]
Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, Kangkook Jee, Zhichun Li, and Adam Bates. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. Innetwork and distributed systems security symposium
work page 2019
-
[21]
Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. 2017. {SLEUTH}: Real-time attack scenario reconstruction from{COTS} audit data. In26th USENIX Security Symposium (USENIX Security 17). 487–504
work page 2017
-
[22]
Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em- powering incident management with query recommendations via large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[23]
Igor Kotenko, Diana Gaifulina, and Igor Zelichenok. 2022. Systematic literature review of security event correlation methods.IEEE Access10 (2022), 43387–43420
work page 2022
- [24]
-
[25]
Microsoft. 2026. Advanced threat detection with User and Entity Behavior Analytics (UEBA) in Microsoft Sentinel. https://learn.microsoft.com/en-us/ azure/sentinel/identify-threats-with-entity-behavior-analytics
work page 2026
-
[26]
Microsoft. 2026. Understand the advanced hunting schema
work page 2026
-
[27]
Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan
-
[28]
InProceedings of the 2019 ACM SIGSAC conference on computer and communications security
Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. InProceedings of the 2019 ACM SIGSAC conference on computer and communications security. 1795–1812
work page 2019
-
[29]
Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019. Holmes: real-time apt detection through correlation of suspicious information flows. In2019 IEEE symposium on security and privacy (SP). IEEE, 1137–1152
work page 2019
-
[30]
Palo Alto Networks. 2026. Cortex AgentiX. https://www.paloaltonetworks.com/ cortex/agentix
work page 2026
- [31]
-
[32]
IBM Security. 2025. Cost of a Data Breach Report 2025. https://www.ibm.com/ reports/data-breach
work page 2025
-
[33]
Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. (2018)
work page 2018
-
[34]
Kalyan Veeramachaneni, Ignacio Arnaldo, Vamsi Korrapati, Constantinos Bassias, and Ke Li. 2016. AIˆ 2: training a big data machine to defend. In2016 IEEE 2nd international conference on big data security on cloud (BigDataSecurity), IEEE international conference on high performance and smart computing (HPSC), and IEEE international conference on intelligen...
work page 2016
-
[35]
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2026. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. InInternational Conference on Learning Representations. https://openreview.net/forum?id=2YvbLQEdYt Oral presentation
work page 2026
- [36]
-
[37]
Mingtao Wu and Young Moon. 2019. Alert correlation for cyber-manufacturing intrusion detection.Procedia Manufacturing34 (2019), 820–831
work page 2019
-
[38]
Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investi- gation.arXiv preprint arXiv:2507.14201(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Chen Zhong, Tao Lin, Peng Liu, John Yen, and Kai Chen. 2018. A cyber security data triage operation retrieval system.Computers & Security76 (2018), 12–31
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.