pith. sign in

arxiv: 2605.20896 · v2 · pith:4FU7PMUWnew · submitted 2026-05-20 · 💻 cs.CR · cs.AI· cs.LG

GenAI-Driven Threat Detection with Microsoft Security Copilot

Pith reviewed 2026-05-25 06:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords threat detectionLLM agentssecurity automationincident investigationalert generationMicrosoft Defenderadversary tradecraftproduction deployment
0
0 comments X

The pith

Dynamic Threat Detection Agent identifies missed malicious activity with 80.1% precision using LLM investigations at production scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DTDA as an always-on agent that builds unified timelines from security data, applies versioned prompt contracts, and runs a planner-executor loop to form attack hypotheses and generate new alerts when story gaps appear. It reports that this system operates across tens of thousands of Microsoft Defender customers and surfaces novel alerts in roughly 15% of cases while maintaining 80.1% precision according to customer feedback over 120 days. Offline tests show it recovers hidden activity at 0.78 F1, a 0.12 gain over GPT-4.1 and 0.26 over the baseline. A sympathetic reader would care because the work claims that autonomous agents can shift defenders from constant manual updates of detection logic toward continuous, adaptive investigation at enterprise scale.

Core claim

DTDA combines a unified activity timeline, versioned LLM prompt contracts with schema validation and bounded retries, a planner-executor investigation loop that generates attack-specific hypotheses and gathers evidence, and dynamic alert generation with titles, severities, MITRE mappings, and remediation guidance. When integrated into Microsoft Security Copilot and run continuously, it achieves 80.1% precision from customer feedback, produces novel alerts for approximately 15% of investigated incidents, recovers hidden malicious activity at 0.78 F1 with GPT-5.4, and processes single-incident investigations in a median of 28 minutes at a median token cost of USD 2.04 with a 0.38% job-level<f|

What carries the argument

The planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, backed by versioned LLM prompt contracts for schema validation and fail-closed behavior.

If this is right

  • Security teams receive context-relevant alerts with MITRE mappings and remediation steps without manually translating every new attacker technique.
  • The system runs continuously across tens of thousands of customers while keeping median investigation time at 28 minutes and failure rate below 0.4%.
  • Offline F1 improves by 0.12 when moving from GPT-4.1 to GPT-5.4, indicating gains from stronger models inside the same loop.
  • Novel alerts appear in 15% of cases, showing the loop can surface activity missed by existing detectors.
  • Token cost stays at a median of USD 2.04 per incident, supporting repeated use at industry scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same planner-executor structure with prompt contracts could be tested on other security platforms to check whether the reliability mechanisms transfer.
  • If the 15% novel-alert rate holds across more environments, it would imply that current rule-based and analytics systems systematically leave detectable gaps.
  • The bounded-retry and fail-closed design might serve as a practical template for deploying LLM agents in other regulated domains where errors carry high cost.
  • Longer deployments could reveal whether the agent begins to surface patterns in tradecraft that human analysts have not yet documented.

Load-bearing premise

Customer feedback on generated alerts constitutes an unbiased measure of true precision.

What would settle it

An independent audit that labels a held-out set of incidents as malicious or benign through forensic review and then measures how many DTDA-generated alerts match those labels.

Figures

Figures reproduced from arXiv: 2605.20896 by Amir Gharib, Scott Freitas.

Figure 1
Figure 1. Figure 1: Overview of the DTDA architecture: an industry-scale framework for autonomous threat detection. DTDA builds incident-centered activity timelines from alerts, events, UEBA, and threat intel; runs a bounded planner-executor investigation to gather supporting and refuting evidence; and emits a dynamic alert when the investigation identifies novel malicious activity. 1.1 Contributions We introduce DTDA ( [PIT… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the bounded planner-executor inves [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Runtime scaling by number of incidents per job, line [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper introduces the Dynamic Threat Detection Agent (DTDA), an always-on LLM-based agent integrated into Microsoft Security Copilot that uses a unified activity timeline, versioned prompt contracts, a planner-executor investigation loop, and dynamic alert generation to investigate incidents in Microsoft Defender. It reports a 120-day online evaluation yielding 80.1% precision from customer feedback and novel alerts in ~15% of cases, plus offline results of 0.78 F1 (GPT-5.4) that improve 0.12 F1 over GPT-4.1 and 0.26 F1 over baseline, with median 28-minute investigations at $2.04 token cost and 0.38% failure rate.

Significance. If the evaluation protocols are rigorously documented and the performance numbers hold under independent scrutiny, the work provides concrete evidence that autonomous agents can surface missed threats at production scale across tens of thousands of customers. The architecture (prompt contracts + planner-executor loop) supplies a reproducible template that could influence future security tooling, and the operational metrics (latency, cost, failure rate) offer practical deployment insights. The absence of disclosed free parameters or circular derivations is a positive feature of the reported figures.

major comments (4)
  1. [Abstract] Abstract (online evaluation paragraph): The 80.1% precision figure is presented as a direct measurement from customer feedback, yet no protocol is supplied for feedback solicitation, response rate, stratification by incident severity/type, or bias controls. This renders the central online claim unverifiable and matches the weakest assumption identified in the review.
  2. [Abstract] Abstract (offline evaluation paragraph): The baseline comparator is undefined, and the text does not state that only the planner-executor loop and prompt contracts were varied while holding all other factors fixed. Consequently the 0.26 F1 attribution cannot be isolated and the 0.12 F1 GPT-5.4 vs. GPT-4.1 gain lacks a controlled experimental frame.
  3. [Abstract] Abstract (evaluation paragraphs): No error bars, confidence intervals, or statistical significance tests accompany the reported F1 scores, precision, or improvement deltas, preventing assessment of whether the observed gains exceed sampling variability.
  4. [Evaluation] Evaluation section: The manuscript must supply an explicit methods subsection detailing data collection, labeling, and filtering procedures for both the online customer-feedback pipeline and the offline hidden-activity recovery task so that the reported numbers can be reproduced or audited.
minor comments (2)
  1. [Abstract] Abstract: The model designations 'GPT-5.4' and 'GPT-4.1' should be clarified (internal aliases, fine-tunes, or release versions) to allow readers to interpret the reported deltas.
  2. [Abstract] Operational metrics: The median latency, cost, and failure rate are given without the underlying sample size or inter-quartile range, which would improve interpretability of the scale claims.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the transparency of our evaluation protocols. We will revise the manuscript to address each point, adding the requested details on methods, controls, and statistical reporting while respecting data privacy constraints inherent to production customer data.

read point-by-point responses
  1. Referee: [Abstract] Abstract (online evaluation paragraph): The 80.1% precision figure is presented as a direct measurement from customer feedback, yet no protocol is supplied for feedback solicitation, response rate, stratification by incident severity/type, or bias controls. This renders the central online claim unverifiable and matches the weakest assumption identified in the review.

    Authors: We agree that the customer feedback protocol requires explicit documentation. In the revised manuscript we will add a dedicated subsection under Evaluation that describes the feedback solicitation process (including how customers are prompted within the Copilot interface), available response-rate statistics, stratification by incident severity and type where applicable, and steps taken to reduce selection bias. Due to customer privacy policies, certain granular details will necessarily remain at an aggregated level, but the high-level protocol will be fully specified. revision: yes

  2. Referee: [Abstract] Abstract (offline evaluation paragraph): The baseline comparator is undefined, and the text does not state that only the planner-executor loop and prompt contracts were varied while holding all other factors fixed. Consequently the 0.26 F1 attribution cannot be isolated and the 0.12 F1 GPT-5.4 vs. GPT-4.1 gain lacks a controlled experimental frame.

    Authors: We acknowledge the need for clearer definition and experimental controls. The baseline is a static rule-based detector that does not employ the planner-executor loop or versioned prompt contracts. In the revision we will (1) explicitly name and describe the baseline in both the abstract and Evaluation section, and (2) state that all other components (timeline construction, alert schema, model temperature, etc.) were held constant across the GPT-5.4, GPT-4.1, and baseline conditions so that the reported F1 deltas isolate the contribution of the planner-executor architecture and prompt contracts. revision: yes

  3. Referee: [Abstract] Abstract (evaluation paragraphs): No error bars, confidence intervals, or statistical significance tests accompany the reported F1 scores, precision, or improvement deltas, preventing assessment of whether the observed gains exceed sampling variability.

    Authors: We will add the requested statistical reporting. The revised Evaluation section will include 95% confidence intervals (via bootstrap resampling) for all F1 and precision figures, error bars on the reported deltas, and the results of paired statistical tests (McNemar or bootstrap significance tests) comparing the GPT-5.4 configuration against both GPT-4.1 and the baseline. These additions will allow readers to evaluate whether the observed improvements exceed sampling variability. revision: yes

  4. Referee: [Evaluation] Evaluation section: The manuscript must supply an explicit methods subsection detailing data collection, labeling, and filtering procedures for both the online customer-feedback pipeline and the offline hidden-activity recovery task so that the reported numbers can be reproduced or audited.

    Authors: We concur that an explicit methods subsection is required. We will insert a new 'Evaluation Methods' subsection that details: (a) the online pipeline (incident sampling criteria, customer feedback collection mechanism, labeling of true/false positives, and any filtering for duplicate or low-quality incidents); and (b) the offline hidden-activity recovery task (construction of the ground-truth set, labeling protocol, temporal hold-out rules, and filtering steps). This will enable external reproduction or audit to the extent permitted by Microsoft’s data-access and privacy policies. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper reports performance figures (80.1% precision from customer feedback, 0.78 F1 offline, 0.12/0.26 F1 gains) as outcomes of a 120-day deployment and separate offline runs. These quantities are obtained by running the described DTDA system on real incidents and comparing against baselines; they are not obtained by fitting parameters inside the same equations or by renaming fitted values as predictions. No equations, ansatzes, or uniqueness theorems appear in the abstract or described content. The central claims rest on external data collection (customer feedback, offline labeling) rather than internal self-definition or self-citation chains, satisfying the condition for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that LLM-based planning and evidence gathering can be made reliable through prompt contracts and grounding, plus the empirical claim that customer feedback measures true precision. No explicit free parameters or invented physical entities are stated.

axioms (1)
  • domain assumption LLM-based planner-executor loops can generate and validate attack hypotheses from a unified activity timeline without systematic bias or hallucination when prompt contracts are used.
    Invoked by the description of the investigation loop and versioned prompt contracts.
invented entities (1)
  • Dynamic Threat Detection Agent (DTDA) no independent evidence
    purpose: Continuous adaptive investigation and dynamic alert generation inside Microsoft Defender.
    New named system introduced by the paper; no independent falsifiable evidence outside the reported evaluations is supplied.

pith-pipeline@v0.9.0 · 5851 in / 1604 out tokens · 33021 ms · 2026-05-25T06:27:57.927249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Hilala Alturkistani and Mohammed A El-Affendi. 2022. Optimizing cybersecurity incident response decisions using deep reinforcement learning.International Journal of Electrical and Computer Engineering12, 6 (2022), 6768

  2. [2]

    Microsoft Azure. 2026. Content Safety in Foundry Control Plane. https://azure. microsoft.com/en-us/products/ai-services/ai-content-safety/

  3. [3]

    Kritan Banstola, Faayed Al Faisal, and Xinming Ou. 2026. Experiences of Using Agentic AI to Fill Tooling Gaps in a Security Operations Center. InWorkshop on Security Operation Center Operations and Construction (WOSOC)

  4. [4]

    Even Eilertsen, Vasileios Mavroeidis, and Gudmund Grov. 2025. Towards Agentic Investigation of Security Alerts. In2025 IEEE International Conference on Big Data (BigData). IEEE, 7793–7802

  5. [5]

    Erez Einav. 2023. Introducing a Unified Security Operations Platform with Microsoft Sentinel and Defender XDR

  6. [6]

    Huwaida Tagelsir Elshoush and Izzeldin Mohamed Osman. 2013. Intrusion alert correlation framework: An innovative approach. InIAENG Transactions on Engineering Technologies: Special Volume of the World Congress on Engineering

  7. [7]

    Bingrui Foo, Y-S Wu, Y-C Mao, Saurabh Bagchi, and Eugene Spafford. 2005. ADEPTS: Adaptive intrusion response using attack graphs in an e-commerce en- vironment. In2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, 508–517

  8. [8]

    Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs, Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. 2020. SecBot: a business-driven conversational agent for cybersecurity planning and management. In2020 16th international conference on network and service man- agement (CNSM). IEEE, 1–7

  9. [9]

    Scott Freitas and Amir Gharib. 2024. GraphWeaver: Billion-Scale Cybersecurity Incident Correlation. Inproceedings of the 33th ACM international CIKM

  10. [10]

    Scott Freitas and Amir Gharib. 2025. Web scale graph mining for cyber threat intelligence. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4447–4456

  11. [11]

    Scott Freitas, Jovan Kalajdjieski, Amir Gharib, and Robert McCann. 2025. AI- driven guided response for security operation centers with Microsoft Copilot for Security. InCompanion Proceedings of the ACM on Web Conference 2025. 191–200

  12. [12]

    Scott Freitas, Andrew Wicker, Duen Horng Chau, and Joshua Neil. 2020. D2M: Dy- namic defense and modeling of adversarial movement in networks. InProceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 541–549

  13. [13]

    Amir Gharib, Scott Freitas, and Maayan Magenheim. 2026. Introducing AI- powered incident prioritization in Microsoft Defender. https://techcommunity. microsoft.com/blog/microsoftthreatprotectionblog/introducing-ai-powered- incident-prioritization-in-microsoft-defender/4483834

  14. [14]

    Google. 2026. The agentic SOC. https://cloud.google.com/solutions/security/ agentic-soc

  15. [15]

    Google. 2026. Use Triage and Investigation Agent to investigate alerts. https: //docs.cloud.google.com/chronicle/docs/secops/triage-investigation-agent

  16. [16]

    Gustavo Gonzalez Granadillo, Mohammed El-Barbori, and Herve Debar. 2016. New types of alert correlation for security information and event management systems. In2016 8th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–7

  17. [17]

    Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer

  18. [18]

    Unicorn: Runtime provenance-based detector for advanced persistent threats.arXiv preprint arXiv:2001.01525(2020)

  19. [19]

    Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020. Tactical provenance analysis for endpoint detection and response systems. In2020 IEEE symposium on security and privacy (SP). IEEE, 1172–1189

  20. [20]

    Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, Kangkook Jee, Zhichun Li, and Adam Bates. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. Innetwork and distributed systems security symposium

  21. [21]

    Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. 2017. {SLEUTH}: Real-time attack scenario reconstruction from{COTS} audit data. In26th USENIX Security Symposium (USENIX Security 17). 487–504

  22. [22]

    Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em- powering incident management with query recommendations via large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  23. [23]

    Igor Kotenko, Diana Gaifulina, and Igor Zelichenok. 2022. Systematic literature review of security event correlation methods.IEEE Access10 (2022), 43387–43420

  24. [24]

    Sahaya Jestus Lazer, Kshitiz Aryal, Maanak Gupta, and Elisa Bertino. 2026. A Survey of Agentic AI and Cybersecurity: Challenges, Opportunities and Use-case Prototypes.arXiv preprint arXiv:2601.05293(2026)

  25. [25]

    Microsoft. 2026. Advanced threat detection with User and Entity Behavior Analytics (UEBA) in Microsoft Sentinel. https://learn.microsoft.com/en-us/ azure/sentinel/identify-threats-with-entity-behavior-analytics

  26. [26]

    Microsoft. 2026. Understand the advanced hunting schema

  27. [27]

    Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan

  28. [28]

    InProceedings of the 2019 ACM SIGSAC conference on computer and communications security

    Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. InProceedings of the 2019 ACM SIGSAC conference on computer and communications security. 1795–1812

  29. [29]

    Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019. Holmes: real-time apt detection through correlation of suspicious information flows. In2019 IEEE symposium on security and privacy (SP). IEEE, 1137–1152

  30. [30]

    Palo Alto Networks. 2026. Cortex AgentiX. https://www.paloaltonetworks.com/ cortex/agentix

  31. [31]

    Jonathan Oliver, Raghav Batta, Adam Bates, Muhammad Adil Inam, Shelly Mehta, and Shugao Xia. 2024. Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search.arXiv preprint arXiv:2405.04691(2024)

  32. [32]

    IBM Security. 2025. Cost of a Data Breach Report 2025. https://www.ibm.com/ reports/data-breach

  33. [33]

    Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. (2018)

  34. [34]

    Kalyan Veeramachaneni, Ignacio Arnaldo, Vamsi Korrapati, Constantinos Bassias, and Ke Li. 2016. AIˆ 2: training a big data machine to defend. In2016 IEEE 2nd international conference on big data security on cloud (BigDataSecurity), IEEE international conference on high performance and smart computing (HPSC), and IEEE international conference on intelligen...

  35. [35]

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2026. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. InInternational Conference on Learning Representations. https://openreview.net/forum?id=2YvbLQEdYt Oral presentation

  36. [36]

    Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, and Chris Jordan. 2025. CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage.arXiv preprint arXiv:2510.00311(2025)

  37. [37]

    Mingtao Wu and Young Moon. 2019. Alert correlation for cyber-manufacturing intrusion detection.Procedia Manufacturing34 (2019), 820–831

  38. [38]

    Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investi- gation.arXiv preprint arXiv:2507.14201(2025)

  39. [39]

    Chen Zhong, Tao Lin, Peng Liu, John Yen, and Kai Chen. 2018. A cyber security data triage operation retrieval system.Computers & Security76 (2018), 12–31