pith. machine review for the scientific record. sign in

arxiv: 2604.23338 · v2 · submitted 2026-04-25 · 💻 cs.CR · cs.LG

Recognition: unknown

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

Kexin Chu

Pith reviewed 2026-05-08 07:49 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords LLM agentssecurity threatsdefense mechanismsattack surfacesystematic surveylayered taxonomymulti-agent systemsbenchmark gaps
0
0 comments X

The pith

A seven-layer taxonomy shows that LLM agents have large regions of the attack surface with no defenses or benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based AI agents maintain memory, invoke tools, coordinate with other agents, and operate across sessions, so attacks can arise from architectural state and long-horizon interactions rather than single prompts alone. The paper introduces the Layered Attack Surface Model that splits the agent into seven layers from Foundation to Governance and adds a four-class temporality axis covering instantaneous through sub-session-stack threats. When the authors map 116 papers onto this grid, they find that research and defenses cluster in the lower layers while upper layers and threats that persist or accumulate remain almost empty. The resulting map also identifies attack regions that have documented threats but no defenses, and shows that existing benchmarks test none of the cross-session or propagating failure modes. This structure lets the field separate near-term engineering gaps from deeper research challenges.

Core claim

The authors introduce the Layered Attack Surface Model (LASM), a structural taxonomy that decomposes the agentic stack into seven layers—Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance—augmented by a four-class temporality axis of instantaneous, session-persistent, cross-session cumulative, and sub-session-stack threats. Applying the 7x4 framework to 116 papers from 2021-2026 produces a map that locates under-explored upper layers, especially for long-horizon and stack-propagating threats, multiple attack regions that have no corresponding defenses, and benchmarks that provide no coverage for cross-session or sub-session-stack failure modes.

What carries the argument

The Layered Attack Surface Model (LASM), a 7x4 grid that places each threat at the intersection of an architectural layer and a temporal scope to expose uncovered regions and missing defenses.

If this is right

  • Security work must expand into the upper layers of the agentic stack for long-horizon and stack-propagating threats.
  • New benchmarks are required that test cross-session and sub-session-stack failure modes.
  • Defense development is needed for the multiple documented attack regions that currently have none.
  • A dependency DAG separates near-term engineering gaps from fundamental research challenges.
  • Cross-layer defense recipes become feasible once threats are located within the grid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could apply the released Agent Bill of Materials schema to audit deployed agents for missing upper-layer protections.
  • The framework implies that standards and regulations for autonomous agents will need explicit requirements for persistent state and coordination risks.
  • Future work could test whether the same 7x4 structure applies to hybrid human-AI teams or non-LLM agent architectures.
  • Repeated application of the model over time would show whether the coverage gaps narrow or widen as the field advances.

Load-bearing premise

The 116 selected papers form a representative sample of the literature and the seven-layer decomposition plus four-class temporality axis supplies a complete, non-overlapping partition of the agentic attack surface.

What would settle it

A follow-up survey that locates substantial existing work on upper-layer long-horizon defenses or an experiment that demonstrates threats routinely crossing the proposed layer boundaries in ways the taxonomy cannot classify.

Figures

Figures reproduced from arXiv: 2604.23338 by Kexin Chu.

Figure 1
Figure 1. Figure 1: Publication trend of surveyed papers by year. The view at source ↗
Figure 2
Figure 2. Figure 2: AI agent architecture mapped to LASM layers. The dashed red arrow marks principal trust inversion: tool outputs (environment inputs) flow back into the planning layer as if authoritative, despite being the least trusted principal. This structural flaw underlies most L4 attacks. ordered chain of principals whose instructions the agent is expected to follow. Definition: Principal Hierarchy A principal hierar… view at source ↗
Figure 3
Figure 3. Figure 3: Research coverage heatmap: LASM layers (y-axis) vs. attack temporality (x-axis). Cell color intensity is propor￾tional to the coded paper count from Table IV (opacity = count / 12 × 0.70); the count is printed in each cell for direct verification. A paper is counted in each (layer, temporality) cell it primarily addresses, so counts sum to more than 94. The dashed red box marks the under-studied region (L5… view at source ↗
read the original abstract

Agentic AI systems introduce a security surface that is qualitatively different from that of stateless LLMs. They persist memory, invoke external tools, coordinate with peer agents, and operate across sessions, allowing attacks to emerge not only at the prompt interface but also through architectural state, delegated authority, and long-horizon interactions. Existing security taxonomies, however, primarily organize threats by attack type, such as prompt injection or jailbreaking, and therefore obscure where in the agentic stack a threat arises and over what timescale it manifests. We propose the Layered Attack Surface Model (\lasm), a structural taxonomy for agentic AI security. \lasm decomposes the agentic stack into seven layers -- Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance -- and augments them with a four-class temporality axis covering instantaneous, session-persistent, cross-session cumulative, and sub-session-stack threats. We use this 7$\times$4 framework to analyze 116 papers from 2021--2026. The resulting map reveals that the upper layers of the agentic stack remain sharply under-explored, especially for long-horizon and stack-propagating threats; multiple documented attack regions have no corresponding defenses; and current benchmarks provide no coverage for cross-session or sub-session-stack failure modes. We further derive a cross-layer defense taxonomy, defense recipes for canonical attack classes, and a dependency DAG that separates near-term engineering gaps from fundamental research challenges. We release the per-paper coding, robustness scripts, and a reference Agent Bill of Materials schema to support reproducible analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Layered Attack Surface Model (LASM), a 7-layer structural taxonomy (Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, Governance) augmented by a 4-class temporality axis (instantaneous, session-persistent, cross-session cumulative, sub-session-stack). It applies this 7×4 framework to map 116 papers (2021–2026) on LLM-based agent security threats and defenses, identifying sharp under-exploration of upper layers for long-horizon and propagating threats, defense gaps, and benchmark limitations. It also derives a cross-layer defense taxonomy, defense recipes, a dependency DAG, and releases per-paper coding, scripts, and an Agent Bill of Materials schema.

Significance. If the survey methodology proves robust, LASM offers a useful structural lens for agentic AI security that moves beyond attack-type taxonomies, with the released artifacts enabling reproducible gap analysis and future work. The identification of defense voids in multi-agent and governance layers for persistent threats could guide prioritized research if the underlying distribution is representative.

major comments (2)
  1. [Survey Design / Methods] §3 (or equivalent Survey Design section): No search strings, databases, inclusion/exclusion criteria, or inter-rater reliability statistics are provided for the selection and 7×4 classification of the 116 papers. This is load-bearing for the central gap claims, because the reported under-exploration of upper layers (especially long-horizon threats) could arise from keyword bias toward lower-layer attacks such as prompt injection rather than reflecting the true literature distribution.
  2. [Results / Gap Analysis] Results section (gap analysis): The headline finding that upper layers remain sharply under-explored for long-horizon threats is stated without a quantitative table or counts per layer-temporality cell, preventing assessment of whether the observed distribution is statistically meaningful or merely descriptive.
minor comments (2)
  1. [Introduction / Framework] The LASM acronym and layer names are introduced clearly in the abstract but would benefit from an early figure or table that explicitly shows the 7×4 grid with one canonical example per cell.
  2. [Contributions / Artifacts] The released coding artifacts are a strength, but the manuscript should include a brief description of the robustness scripts (e.g., what sensitivity checks they perform) in the main text rather than only in the repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive and detailed referee report. We appreciate the emphasis on methodological transparency and quantitative rigor, as these directly support the validity of our gap analysis. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [Survey Design / Methods] §3 (or equivalent Survey Design section): No search strings, databases, inclusion/exclusion criteria, or inter-rater reliability statistics are provided for the selection and 7×4 classification of the 116 papers. This is load-bearing for the central gap claims, because the reported under-exploration of upper layers (especially long-horizon threats) could arise from keyword bias toward lower-layer attacks such as prompt injection rather than reflecting the true literature distribution.

    Authors: We acknowledge that the current manuscript does not provide explicit details on the literature search protocol, databases, inclusion/exclusion criteria, or inter-rater reliability in the Survey Design section. This is a genuine omission that could reasonably lead to concerns about selection bias, particularly whether lower-layer attacks (e.g., prompt injection) were over-sampled due to keyword choices. In the revised version, we will add a dedicated subsection to §3 that fully documents the systematic review methodology. This will include: the databases searched (arXiv, Google Scholar, IEEE Xplore, ACM Digital Library, and selected security venues); the precise search strings and Boolean combinations used (covering terms for LLM agents, threats, defenses, and each architectural layer); the date range and inclusion criteria (peer-reviewed or preprint papers from 2021–2026 focused on security threats or defenses in LLM-based agents); exclusion criteria (non-agentic systems, non-security papers, duplicates); and the classification procedure with inter-rater reliability statistics (e.g., independent coding of a 20% sample by two authors, reporting Cohen’s kappa). We will also explicitly discuss steps taken to ensure broad coverage across layers and temporality classes, thereby addressing potential keyword bias. These additions will make the gap claims reproducible and defensible. revision: yes

  2. Referee: [Results / Gap Analysis] Results section (gap analysis): The headline finding that upper layers remain sharply under-explored for long-horizon threats is stated without a quantitative table or counts per layer-temporality cell, preventing assessment of whether the observed distribution is statistically meaningful or merely descriptive.

    Authors: We agree that the absence of a quantitative breakdown limits readers’ ability to evaluate the strength of the under-exploration claim. While the manuscript maps all 116 papers to the 7×4 framework and describes the resulting distribution, it does not present cell-by-cell counts or supporting statistics. In the revised manuscript, we will insert a new table (and accompanying heatmap figure) in the Results section that reports the exact number of papers assigned to each layer-temporality cell, with separate columns for threat papers and defense papers. We will also add summary statistics (e.g., percentages per layer and per temporality class) and a short discussion of whether the observed skew toward lower layers and instantaneous threats is statistically notable. This will transform the gap analysis from purely descriptive to quantitatively grounded while preserving the original interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy application is a direct mapping, not a reduction to inputs

full rationale

The paper proposes the LASM 7-layer plus 4-temporality taxonomy and applies it to classify 116 papers, producing a gap map as output. No equations, fitted parameters, predictions, or self-citations are invoked to derive the distribution or under-exploration claims; the map is the direct result of the manual assignment process described. The representativeness of the sample and reproducibility of coding are methodological assumptions external to any derivation chain, not internal reductions by construction. This matches the default expectation for a non-circular survey paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central contribution is the proposed taxonomy itself; the survey analysis rests on the domain assumption that the seven layers and four temporal classes form a meaningful and exhaustive decomposition of agentic attack surfaces.

axioms (2)
  • domain assumption Agentic AI systems can be decomposed into the seven layers Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance.
    This decomposition is introduced as the structural basis for the LASM framework.
  • domain assumption Security threats in agentic systems can be usefully classified along the four temporal scales instantaneous, session-persistent, cross-session cumulative, and sub-session-stack.
    The temporality axis is added to capture persistence and propagation behaviors not addressed by prior taxonomies.
invented entities (1)
  • Layered Attack Surface Model (LASM) no independent evidence
    purpose: To provide a structural taxonomy that locates threats by architectural layer and temporal scale.
    Newly proposed framework for organizing the survey.

pith-pipeline@v0.9.0 · 5591 in / 1570 out tokens · 58578 ms · 2026-05-08T07:49:36.911043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

143 extracted references · 110 canonical work pages · 59 internal anchors

  1. [1]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettle- moyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,”arXiv preprint arXiv:2302.04761, 2023

  2. [2]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2307.16789

  3. [3]

    Gorilla: Large Language Model Connected with Massive APIs

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2305.15334

  4. [4]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face,” in Advances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2303.17580

  5. [5]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing Reasoning and Acting in Language Models,”arXiv preprint arXiv:2210.03629, 2023

  6. [6]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An Open-Ended Embodied Agent with Large Language Models,”arXiv preprint arXiv:2305.16291, 2023

  7. [7]

    AutoGPT: An Autonomous GPT-4 Experiment,

    S. Gravitas, “AutoGPT: An Autonomous GPT-4 Experiment,”GitHub Repository, 2023, https://github.com/Significant-Gravitas/AutoGPT

  8. [8]

    A Survey on Large Language Model Based Autonomous Agents,

    L. Wang, C. Ma, X. Fenget al., “A Survey on Large Language Model Based Autonomous Agents,”Frontiers of Computer Science, vol. 18, 2024

  9. [9]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”arXiv preprint arXiv:2309.07864, 2023

  10. [10]

    Emergent Abilities of Large Language Models

    J. Wei, Y . Tay, R. Bommasaniet al., “Emergent abilities of large language models,”Transactions on Machine Learning Research, 2022, arXiv:2206.07682

  11. [11]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hiltonet al., “WebGPT: Browser-assisted question- answering with human feedback,”arXiv preprint arXiv:2112.09332, 2022

  12. [12]

    10 Jonathan Hayase, Weihao Kong, Raghav Somani, and Sewoong Oh

    Z. Niu, H. Liang, F. Zhang, and X. Li, “Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast,”arXiv preprint arXiv:2402.08567, 2024

  13. [13]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    X. Houet al., “Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions,”ACM Transactions on Software Engineering and Methodology, 2025, arXiv:2503.23278

  14. [14]

    AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways,

    Z. Wanget al., “AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 182:1–182:38, 2025

  15. [15]

    Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

    A. Chhabra, S. Datta, S. K. Nahin, and P. Mohapatra, “Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges,”arXiv preprint arXiv:2510.23883, 2025

  16. [16]

    Security of LLM-based Agents Regarding Attacks, Defenses, and Applications: A Comprehensive Survey,

    Y . Tang, Y . Liu, J. Lan, Z. Yan, and E. Gelenbe, “Security of LLM-based Agents Regarding Attacks, Defenses, and Applications: A Comprehensive Survey,”Information Fusion, vol. 127, 2026

  17. [17]

    PRISMA 2020 Explanation and Elaboration: Updated Guidance and Exemplars for Reporting Systematic Reviews,

    M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrowet al., “PRISMA 2020 Explanation and Elaboration: Updated Guidance and Exemplars for Reporting Systematic Reviews,” BMJ, vol. 372, p. n160, 2021

  18. [19]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, arXiv:1706.03762

  19. [20]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186, arXiv:1810.04805

  20. [21]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mannet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, arXiv:2005.14165

  21. [22]

    GPT-4 Technical Report

    OpenAI, “GPT-4 Technical Report,”arXiv preprint arXiv:2303.08774, 2023

  22. [23]

    Claude’s Model Card,

    Anthropic, “Claude’s Model Card,” 2023, https://www-files.anthropic. com/production/images/model-card-claude-2.pdf

  23. [24]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavrilet al., “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  24. [25]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martinet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  25. [26]

    Christiano, J

    P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, arXiv:1706.03741

  26. [27]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2203.02155

  27. [28]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inNeurIPS, 2022

  28. [29]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional AI: Harmlessness from AI Feedback,”arXiv preprint arXiv:2212.08073, 2022

  29. [30]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudsonet al., “On the opportunities and risks of foundation models,”arXiv preprint arXiv:2108.07258, 2021

  30. [31]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, neurIPS 2022

  31. [32]

    Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework,

    C. Lam, J. Li, L. Zhang, and K. Zhao, “Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework,”arXiv preprint arXiv:2603.11768, 2025

  32. [33]

    Claude’s Model Specification,

    Anthropic, “Claude’s Model Specification,” https://anthropic.com/model- spec, 2024

  33. [34]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    E. Hubingeret al., “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training,”arXiv preprint arXiv:2401.05566, 2024

  34. [35]

    Malicious MCP Server on npm: postmark- mcp Harvests Emails via BCC Injection,

    Snyk Security Research, “Malicious MCP Server on npm: postmark- mcp Harvests Emails via BCC Injection,” Snyk Security Blog, 2025, published September 2025; package postmark-mcp v1.0.16; first docu- mented malicious MCP server in the wild, exfiltrating email content via BCC injection. URL: snyk.io/blog

  35. [36]

    Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis of Vulnerabilities in Skills, Tools, 21 and Protocol Ecosystems,

    N. Maloyanet al., “Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis of Vulnerabilities in Skills, Tools, 21 and Protocol Ecosystems,”International Journal of Open Information Technologies, 2025, arXiv:2601.17548

  36. [37]

    OWASP Top 10 for Large Language Model Applications 2025,

    OWASP Foundation, “OWASP Top 10 for Large Language Model Applications 2025,” https://owasp.org/www-project-top-10-for-large- language-model-applications/, 2024

  37. [38]

    MITRE ATLAS: Adversarial Threat Landscape for Artificial Intelligence Systems, v4.5,

    MITRE Corporation, “MITRE ATLAS: Adversarial Threat Landscape for Artificial Intelligence Systems, v4.5,” https://atlas.mitre.org, 2024

  38. [39]

    Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models

    W. Zou, R. Geng, B. Wang, and J. Jia, “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models,”arXiv preprint arXiv:2402.07867, 2024

  39. [40]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,”arXiv preprint arXiv:2307.15043, 2023

  40. [41]

    Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019

    E. Hubinger, C. van Merwijk, V . Mikulik, J. Steen, and S. Garrabrant, “Risks from Learned Optimization in Advanced Machine Learning Systems,”arXiv preprint arXiv:1906.01820, 2019

  41. [42]

    Shadow alignment: The ease of subverting safely-aligned language models

    X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y . Wang, X. Zhao, and D. Lin, “Shadow alignment: The ease of subverting safely-aligned language models,”arXiv preprint arXiv:2310.02949, 2023

  42. [43]

    Poisoning language models during instruction tuning,

    A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, arXiv:2305.00944

  43. [44]

    A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

    Z. Lin, C. Li, and K. Chen, “A Survey on the Security of Long- Term Memory in LLM Agents: Toward Mnemonic Sovereignty,”arXiv preprint arXiv:2604.16548, 2025

  44. [45]

    The effects of reward misspecification: Adversarial examples in reinforcement learning.arXiv preprint arXiv:2201.03544, 2022

    A. Pan, K. Bhatia, and J. Steinhardt, “The Effects of Reward Misspeci- fication: Mapping and Mitigating Misaligned Models,” inInternational Conference on Learning Representations (ICLR), 2022, iCLR 2022; arXiv:2201.03544

  45. [46]

    Interpreting Agentic Systems: Beyond Model Explana- tions to System-Level Accountability,

    J. Zhu, D. Gandhi, H. Joshi, A. R. Mianroodi, S. A. Kocak, and D. Ra- machandran, “Interpreting Agentic Systems: Beyond Model Explana- tions to System-Level Accountability,”arXiv preprint arXiv:2601.17168, 2025

  46. [47]

    Demonstrating specification gaming in reasoning models,

    A. Bondarenkoet al., “Demonstrating specification gaming in reasoning models,”arXiv preprint arXiv:2502.13295, 2025

  47. [48]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “AgentPoison: Red- teaming LLM agents via poisoning memory or knowledge bases,” in Advances in Neural Information Processing Systems (NeurIPS), 2024, arXiv:2407.12784

  48. [49]

    HijackRAG: Hijacking attacks against retrieval-augmented large language models,

    Y . Zhang, Q. Deng, T. Zhao, Z. Li, X. Su, Z. Zhu, H. Zhang, and C. Zhuge, “HijackRAG: Hijacking attacks against retrieval-augmented large language models,”arXiv preprint arXiv:2410.22832, 2024

  49. [50]

    Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

    J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward hacking,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022, arXiv:2209.13085

  50. [51]

    Colosseum: Auditing collusion in cooperative multi-agent systems, 2026

    M. Nakamura, A. Kumar, S. Das, S. Abdelnabi, S. Mahmud, F. Fioretto, S. Zilberstein, and E. Bagdasarian, “Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems,”arXiv preprint arXiv:2602.15198, 2025

  51. [52]

    arXiv preprint arXiv:2507.14928 , year=

    Y . Jo, M. Jeong, J. Im, and J. Kang, “Byzantine-robust decentralized coordination of LLM agents,”arXiv preprint arXiv:2507.14928, 2025

  52. [53]

    Large language model supply chain: A research agenda,

    J. Weng, Y . Yao, H. Chen, Y . Xie, Z. Li, P. Hu, and D. Gu, “Large language model supply chain: A research agenda,”arXiv preprint arXiv:2404.12736, 2024

  53. [54]

    Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387,

    J. Yi, J. Ye, Z. Chen, R. Xiong, M. Xu, J. Li, L. Ma, T. Gui, Q. Zhang, and X. Huang, “Jailbreak and guard aligned language models with only few in-context demonstrations,”arXiv preprint arXiv:2310.06387, 2023

  54. [55]

    Jailbroken: How Does LLM Safety Training Fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, neurIPS 2023

  55. [56]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inInternational Conference on Learning Representations (ICLR), 2014, arXiv:1312.6199

  56. [57]

    Explaining and Harnessing Adversarial Examples

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,”arXiv preprint arXiv:1412.6572, 2014

  57. [58]

    Available: https://arxiv.org/pdf/1712.03141.pdf 19 N

    B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,”Pattern Recognition, vol. 84, pp. 317– 331, 2018, arXiv:1712.03141

  58. [59]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,” in International Conference on Learning Representations (ICLR), 2024, arXiv:2310.04451

  59. [60]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in International Conference on Learning Representations (ICLR), 2024, arXiv:2310.08419

  60. [61]

    Many-Shot Jailbreaking,

    C. Anil, E. Durmus, M. Sharma, J. Benton, E. Perez, R. Grosse, D. Duvenaudet al., “Many-Shot Jailbreaking,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, anthropic Technical Report; also presented at NeurIPS 2024

  61. [62]

    Dissecting Adversarial Robustness of Multimodal LM Agents,

    C. Wuet al., “Dissecting Adversarial Robustness of Multimodal LM Agents,” inICLR 2025, 2025, arXiv:2406.12814

  62. [63]

    Stealing machine learning models via prediction APIs,

    F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction APIs,” in25th USENIX Security Symposium (USENIX Security), 2016, arXiv:1609.02943

  63. [64]

    Membership inference attacks against machine learning models,

    R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” inIEEE Symposium on Security and Privacy (S&P), 2017

  64. [65]

    Feder and Lee, Katherine and Jagielski, Matthew and Nasr, Milad and Conmy, Arthur and Yona, Itay and Wallace, Eric and Rolnick, David and Tramèr, Florian , month = jul, year =

    N. Carlini, D. Paleka, K. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmyet al., “Stealing part of a production language model,” inInternational Conference on Machine Learning (ICML), 2024, arXiv:2403.06634

  65. [66]

    Extracting training data from large language models

    N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,” in30th USENIX Security Symposium, 2021, pp. 2633–2650, arXiv:2012.07805

  66. [67]

    Quantifying Memorization Across Neural Language Models

    N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, and C. Zhang, “Quantifying memorization across neural language models,” inIn- ternational Conference on Learning Representations (ICLR), 2023, arXiv:2202.07646

  67. [68]

    Feder Cooper, Daphne Ippolito, Christopher A

    M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,”arXiv preprint arXiv:2311.17035, 2023

  68. [69]

    Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789,

    W. Shi, A. Ajith, M. Xia, Y . Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretraining data from large language models,” inInternational Conference on Learning Representations (ICLR), 2024, arXiv:2310.16789

  69. [70]

    Model inversion attacks that exploit confidence information and basic countermeasures,

    M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), 2015

  70. [71]

    Ignore Previous Prompt: Attack Techniques For Language Models,

    F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques For Language Models,” inEMNLP 2022 Workshop on Trustworthy NLP, 2022

  71. [72]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct Preference Optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023, arXiv:2305.18290

  72. [73]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations (ICLR), 2018, arXiv:1706.06083

  73. [74]

    and Rosenfeld, Elan and Kolter, J

    J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, “Certified adversarial robustness via randomized smoothing,” inInternational Conference on Machine Learning (ICML), 2019, arXiv:1902.02918

  74. [75]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama Guard: LLM-based input-output safeguard for human-AI conversations,”arXiv preprint arXiv:2312.06674, 2023

  75. [76]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks, “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal,”arXiv preprint arXiv:2402.04249, 2024

  76. [77]

    PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

    K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, N. Z. Gong, Y . Zhang, and X. Xie, “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv preprint arXiv:2306.04528, 2023

  77. [78]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

  78. [79]

    Red Teaming Language Models with Language Models

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), 2022, arXiv:2202.03286

  79. [80]

    A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

    J. Kirchenbauer, J. Geiping, Y . Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” inInternational Conference on Machine Learning (ICML), 2023, arXiv:2301.10226

  80. [81]

    Deep learning with differential privacy,

    M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” 22 inProceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016

Showing first 80 references.