pith. sign in

arxiv: 2605.20896 · v1 · pith:4FU7PMUWnew · submitted 2026-05-20 · 💻 cs.CR · cs.AI· cs.LG

GenAI-Driven Threat Detection with Microsoft Security Copilot

Pith reviewed 2026-05-21 04:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords threat detectionautonomous agentslarge language modelssecurity incidentsmalicious activitycybersecurityincident investigationadaptive detection
0
0 comments X

The pith

An autonomous agent uses language-model planning to investigate security incidents and generate novel detections for hidden threats at production scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Dynamic Threat Detection Agent, an always-on system that examines security incidents in Microsoft Defender by building a unified activity timeline and running a planner-executor loop. The agent forms attack-specific hypotheses, gathers evidence, and fills gaps in existing alerts by producing new explainable detections with titles, severity ratings, and remediation steps. In a 120-day real-world deployment it reaches 80.1 percent precision according to customer feedback and surfaces novel alerts in roughly 15 percent of cases examined. Offline tests show it recovers hidden malicious activity with an F1 score of 0.78, an improvement over prior model versions and baselines. These results indicate that such agents can shift defenders from constant manual rule updates toward continuous, adaptive investigation.

Core claim

DTDA combines a unified activity timeline spanning alerts, events, user behavior, and threat intelligence with versioned LLM prompt contracts, a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, and dynamic alert generation that supplies context-relevant titles, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack descriptions. When integrated into Microsoft Security Copilot and run across tens of thousands of customers, the agent achieves 80.1 percent precision from customer feedback, produces novel alerts for approximately 15 percent of investigated incidents, recovers 0.78

What carries the argument

The planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence, guided by versioned LLM prompt contracts with schema validation and fail-closed suppression.

If this is right

  • DTDA can run continuously at industry scale, processing single-incident investigations in a median of 28 minutes at a median token cost of USD 2.04 with a 0.38 percent job-level failure rate.
  • The agent reduces the need for defenders to maintain constantly updated detection logic by translating evolving attacker tradecraft into new alerts autonomously.
  • Offline results show measurable gains when moving from GPT-4.1 to GPT-5.4 within the same planner-executor framework, indicating that model improvements directly translate to better recovery of hidden activity.
  • Novel alerts appear in 15 percent of investigated incidents while maintaining 80.1 percent precision, suggesting the system surfaces activity that existing detectors miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The combination of prompt contracts and bounded retries offers a practical pattern for making large-language-model agents reliable enough for high-stakes operational use.
  • If the same architecture were applied to other security products beyond Microsoft Defender, it could create a consistent layer of adaptive investigation across endpoint, network, and cloud signals.
  • Over time the agent could accumulate a growing library of attack hypotheses that become reusable building blocks for future investigations.

Load-bearing premise

Customer feedback on alert precision serves as an unbiased proxy for true positive rate, and the agent loop does not systematically miss real threats or create inappropriate alerts across the full range of attack types.

What would settle it

A controlled test that injects a set of known attack scenarios into live customer environments and measures the fraction of those scenarios for which DTDA either detects the activity or correctly generates a new alert against ground-truth labels.

Figures

Figures reproduced from arXiv: 2605.20896 by Amir Gharib, Scott Freitas.

Figure 1
Figure 1. Figure 1: Overview of the DTDA architecture: an industry-scale framework for autonomous threat detection. DTDA builds incident-centered activity timelines from alerts, events, UEBA, and threat intel; runs a bounded planner-executor investigation to gather supporting and refuting evidence; and emits a dynamic alert when the investigation identifies novel malicious activity. 1.1 Contributions We introduce DTDA ( [PIT… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the bounded planner-executor inves [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Runtime scaling by number of incidents per job, line [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent integrated into Microsoft Security Copilot that investigates security incidents in Microsoft Defender. DTDA combines a unified activity timeline, versioned LLM prompt contracts with schema validation and fail-closed suppression, a planner-executor loop for hypothesis generation and evidence gathering, and dynamic alert generation including titles, severity, MITRE mappings, and remediation guidance. The paper reports a 120-day online evaluation across tens of thousands of customers yielding 80.1% precision from customer feedback with novel alerts on approximately 15% of investigated incidents, plus offline results of 0.78 F1 using GPT-5.4 (improving 0.12 F1 over GPT-4.1 and 0.26 F1 over baseline), and operational metrics of median 28-minute end-to-end investigations at median USD 2.04 token cost with 0.38% job failure rate.

Significance. If the reported results hold under rigorous independent validation, the work would demonstrate the viability of autonomous GenAI agents for continuous threat detection at production scale, moving security operations from reactive rule maintenance toward proactive discovery of hidden malicious activity. The large-scale deployment, specific numerical outcomes from both online customer feedback and offline F1 comparisons, and concrete operational metrics (time, cost, failure rate) provide practical evidence that could inform similar LLM-agent systems in security. The use of prompt contracts with grounding and bounded retries is a concrete engineering contribution.

major comments (2)
  1. [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.
  2. [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.
minor comments (2)
  1. [Results] Clarify the exact model versions referenced as GPT-5.4 and GPT-4.1 (internal releases, dates, or public equivalents) to allow reproducibility and comparison with external work.
  2. [Related Work] Add citations to prior academic work on LLM agents for security incident investigation and prompt-engineering techniques with schema validation to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Online results] Online evaluation (120-day deployment results): The headline 80.1% precision and ~15% novel-alert rate are computed exclusively from customer feedback, yet the manuscript provides no details on feedback collection volume, response rates, analyst selection criteria, or any independent ground-truth validation. This is load-bearing for the central claim of recovering hidden malicious activity at scale, because customer feedback is an incomplete proxy that may undercount false negatives and introduce confirmation bias once an LLM-generated alert is presented.

    Authors: We agree that greater transparency on the online evaluation would strengthen the paper. Customer feedback is gathered through the standard voluntary rating interface in Microsoft Defender and Security Copilot, where analysts explicitly mark generated alerts as true or false positives. Due to privacy regulations and the deployment scale across tens of thousands of customers, we cannot release granular statistics such as exact response volumes or selection criteria. We have added a high-level description of the feedback mechanism and a limitations paragraph acknowledging potential biases and the proxy nature of the metric. These changes appear in the revised Evaluation section. revision: partial

  2. Referee: [Evaluation / Offline results] Offline evaluation: The abstract and results sections report 0.78 F1 with GPT-5.4 and specific improvements (0.12 over GPT-4.1, 0.26 over baseline) but supply no description of how the hidden-malicious-activity labels were obtained, how the evaluation set was sampled or constructed independently of the DTDA prompt contracts, or the exact implementation and features of the baseline. Without these, the numerical gains cannot be interpreted as evidence that the planner-executor loop systematically reduces undetected false negatives.

    Authors: We appreciate this observation. The offline labels were produced by expert security analysts who reviewed complete incident timelines and external threat intelligence to identify activity missed by existing detections. The evaluation incidents were sampled from a disjoint time window and customer set that was not used during prompt-contract development. The baseline is a non-agentic LLM prompt performing direct classification without hypothesis generation or evidence collection. We have expanded Section 4.2 with these methodological details, including a summary of dataset construction and baseline features, to support interpretation of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results presented as direct measurements

full rationale

The paper reports empirical outcomes from a 120-day online deployment (80.1% precision via customer feedback, ~15% novel alerts) and offline evaluation (0.78 F1 with GPT-5.4). These quantities are measured directly from system operation and labeled incident data rather than derived via equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce to prior fitted values or self-citations by construction; the central claims rest on external deployment metrics and comparisons that remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the premise that carefully engineered LLM prompts plus a unified data timeline are sufficient to produce reliable attack hypotheses and evidence gathering at scale; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Large language models guided by versioned prompt contracts with schema validation and grounding requirements can generate accurate attack hypotheses and supporting evidence from security timelines
    This assumption directly enables the planner-executor investigation loop and dynamic alert generation described in the abstract.
invented entities (1)
  • Dynamic Threat Detection Agent (DTDA) no independent evidence
    purpose: Always-on adaptive agent that investigates incidents and generates explainable detections when attack-story gaps are found
    The agent is the primary new system introduced by the paper.

pith-pipeline@v0.9.0 · 5851 in / 1616 out tokens · 69161 ms · 2026-05-21T04:26:11.906263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Hilala Alturkistani and Mohammed A El-Affendi. 2022. Optimizing cybersecurity incident response decisions using deep reinforcement learning.International Journal of Electrical and Computer Engineering12, 6 (2022), 6768

  2. [2]

    Microsoft Azure. 2026. Content Safety in Foundry Control Plane. https://azure. microsoft.com/en-us/products/ai-services/ai-content-safety/

  3. [3]

    Kritan Banstola, Faayed Al Faisal, and Xinming Ou. 2026. Experiences of Using Agentic AI to Fill Tooling Gaps in a Security Operations Center. InWorkshop on Security Operation Center Operations and Construction (WOSOC)

  4. [4]

    Even Eilertsen, Vasileios Mavroeidis, and Gudmund Grov. 2025. Towards Agentic Investigation of Security Alerts. In2025 IEEE International Conference on Big Data (BigData). IEEE, 7793–7802

  5. [5]

    Erez Einav. 2023. Introducing a Unified Security Operations Platform with Microsoft Sentinel and Defender XDR

  6. [6]

    Huwaida Tagelsir Elshoush and Izzeldin Mohamed Osman. 2013. Intrusion alert correlation framework: An innovative approach. InIAENG Transactions on Engineering Technologies: Special Volume of the World Congress on Engineering

  7. [7]

    Bingrui Foo, Y-S Wu, Y-C Mao, Saurabh Bagchi, and Eugene Spafford. 2005. ADEPTS: Adaptive intrusion response using attack graphs in an e-commerce en- vironment. In2005 International Conference on Dependable Systems and Networks (DSN’05). IEEE, 508–517

  8. [8]

    Muriel Figueredo Franco, Bruno Rodrigues, Eder John Scheid, Arthur Jacobs, Christian Killer, Lisandro Zambenedetti Granville, and Burkhard Stiller. 2020. SecBot: a business-driven conversational agent for cybersecurity planning and management. In2020 16th international conference on network and service man- agement (CNSM). IEEE, 1–7

  9. [9]

    Scott Freitas and Amir Gharib. 2024. GraphWeaver: Billion-Scale Cybersecurity Incident Correlation. Inproceedings of the 33th ACM international CIKM

  10. [10]

    Scott Freitas and Amir Gharib. 2025. Web scale graph mining for cyber threat intelligence. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4447–4456

  11. [11]

    Scott Freitas, Jovan Kalajdjieski, Amir Gharib, and Robert McCann. 2025. AI- driven guided response for security operation centers with Microsoft Copilot for Security. InCompanion Proceedings of the ACM on Web Conference 2025. 191–200

  12. [12]

    Scott Freitas, Andrew Wicker, Duen Horng Chau, and Joshua Neil. 2020. D2M: Dy- namic defense and modeling of adversarial movement in networks. InProceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 541–549

  13. [13]

    Amir Gharib, Scott Freitas, and Maayan Magenheim. 2026. Introducing AI- powered incident prioritization in Microsoft Defender. https://techcommunity. microsoft.com/blog/microsoftthreatprotectionblog/introducing-ai-powered- incident-prioritization-in-microsoft-defender/4483834

  14. [14]

    Google. 2026. The agentic SOC. https://cloud.google.com/solutions/security/ agentic-soc

  15. [15]

    Google. 2026. Use Triage and Investigation Agent to investigate alerts. https: //docs.cloud.google.com/chronicle/docs/secops/triage-investigation-agent

  16. [16]

    Gustavo Gonzalez Granadillo, Mohammed El-Barbori, and Herve Debar. 2016. New types of alert correlation for security information and event management systems. In2016 8th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, 1–7

  17. [17]

    Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer

  18. [18]

    Unicorn: Runtime provenance-based detector for advanced persistent threats.arXiv preprint arXiv:2001.01525(2020)

  19. [19]

    Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020. Tactical provenance analysis for endpoint detection and response systems. In2020 IEEE symposium on security and privacy (SP). IEEE, 1172–1189

  20. [20]

    Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, Kangkook Jee, Zhichun Li, and Adam Bates. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. Innetwork and distributed systems security symposium

  21. [21]

    Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. 2017. {SLEUTH}: Real-time attack scenario reconstruction from{COTS} audit data. In26th USENIX Security Symposium (USENIX Security 17). 487–504

  22. [22]

    Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em- powering incident management with query recommendations via large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  23. [23]

    Igor Kotenko, Diana Gaifulina, and Igor Zelichenok. 2022. Systematic literature review of security event correlation methods.IEEE Access10 (2022), 43387–43420

  24. [24]

    Sahaya Jestus Lazer, Kshitiz Aryal, Maanak Gupta, and Elisa Bertino. 2026. A Survey of Agentic AI and Cybersecurity: Challenges, Opportunities and Use-case Prototypes.arXiv preprint arXiv:2601.05293(2026)

  25. [25]

    Microsoft. 2026. Advanced threat detection with User and Entity Behavior Analytics (UEBA) in Microsoft Sentinel. https://learn.microsoft.com/en-us/ azure/sentinel/identify-threats-with-entity-behavior-analytics

  26. [26]

    Microsoft. 2026. Understand the advanced hunting schema

  27. [27]

    Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan

  28. [28]

    InProceedings of the 2019 ACM SIGSAC conference on computer and communications security

    Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. InProceedings of the 2019 ACM SIGSAC conference on computer and communications security. 1795–1812

  29. [29]

    Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019. Holmes: real-time apt detection through correlation of suspicious information flows. In2019 IEEE symposium on security and privacy (SP). IEEE, 1137–1152

  30. [30]

    Palo Alto Networks. 2026. Cortex AgentiX. https://www.paloaltonetworks.com/ cortex/agentix

  31. [31]

    Jonathan Oliver, Raghav Batta, Adam Bates, Muhammad Adil Inam, Shelly Mehta, and Shugao Xia. 2024. Carbon Filter: Real-time Alert Triage Using Large Scale Clustering and Fast Search.arXiv preprint arXiv:2405.04691(2024)

  32. [32]

    IBM Security. 2025. Cost of a Data Breach Report 2025. https://www.ibm.com/ reports/data-breach

  33. [33]

    Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy. (2018)

  34. [34]

    Kalyan Veeramachaneni, Ignacio Arnaldo, Vamsi Korrapati, Constantinos Bassias, and Ke Li. 2016. AIˆ 2: training a big data machine to defend. In2016 IEEE 2nd international conference on big data security on cloud (BigDataSecurity), IEEE international conference on high performance and smart computing (HPSC), and IEEE international conference on intelligen...

  35. [35]

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. 2026. CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale. InInternational Conference on Learning Representations. https://openreview.net/forum?id=2YvbLQEdYt Oral presentation

  36. [36]

    Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, and Chris Jordan. 2025. CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage.arXiv preprint arXiv:2510.00311(2025)

  37. [37]

    Mingtao Wu and Young Moon. 2019. Alert correlation for cyber-manufacturing intrusion detection.Procedia Manufacturing34 (2019), 820–831

  38. [38]

    Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, et al. 2025. Excytin-bench: Evaluating llm agents on cyber threat investi- gation.arXiv preprint arXiv:2507.14201(2025)

  39. [39]

    Chen Zhong, Tao Lin, Peng Liu, John Yen, and Kai Chen. 2018. A cyber security data triage operation retrieval system.Computers & Security76 (2018), 12–31