Agent-Native Immune System: Architecture, Taxonomy, and Engineering

Bo Shen; Dehui Li; Feng Shi; Lifeng Chang; Peijie Gao; Shiyi Kuang; Tianyuan Wei; Xin Chang; Yichen Han; Yunpeng Li

arxiv: 2606.28270 · v1 · pith:CPYCA4G4new · submitted 2026-06-26 · 💻 cs.AI · cs.MA

Agent-Native Immune System: Architecture, Taxonomy, and Engineering

Bo Shen , Lifeng Chang , Tianyuan Wei , Yunpeng Li , Feng Shi , Yichen Han , Peijie Gao , Shiyi Kuang

show 2 more authors

Xin Chang Dehui Li

This is my paper

Pith reviewed 2026-06-29 03:48 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords Agent-Native Immune SystemAI agentsruntime securityagent virusescontinual immune learningmodel alignmentmulti-agent systemsbiologically inspired defense

0 comments

The pith

The Agent-Native Immune System places biologically inspired defenses inside an AI agent's cognitive loop to handle runtime attacks that external measures miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As autonomous agents gain persistent memory, tool-use protocols, and multi-agent collaboration, the threat landscape expands beyond what perimeter security or training-time alignment can address. The paper claims that these external approaches leave agents vulnerable to runtime hijacking through memory poisoning, tool-chain manipulation, or protocol attacks. It introduces the Agent-Native Immune System as an endogenous architecture embedded directly in the reasoning process. The system includes a six-layer Immune Tower, a taxonomy separating superficial defenses from parametric vaccines, and a Harness Triad that supports continual immune learning. It further distinguishes this runtime mechanism from static alignment by framing the former as dynamic enforcement.

Core claim

We introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent's cognitive loop, with a six-layer Immune Tower, a taxonomy of Agent Viruses and Agent Vaccines, the Harness Triad for Continual Immune Learning, and a demarcation that treats alignment as a static constitutional foundation while ANIS acts as runtime law enforcement.

What carries the argument

The Harness Triad of Meta, Self, and Auto components, which supplies the self-monitoring and meta-cognitive automation backbone for Continual Immune Learning.

If this is right

Vaccines adapt dynamically to novel threats through continual immune learning driven by the Harness Triad.
Agents receive protection against runtime attacks including memory poisoning, tool-chain manipulation, and multi-agent protocol exploits.
A clear separation holds between static training-time alignment and dynamic runtime immunity enforcement.
New evaluation metrics such as the Autoimmunity Rate become relevant for measuring false-positive interventions.
Immune protocol standardization and co-evolutionary dynamics between pathogens and vaccines emerge as open challenges in collective agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding ANIS would require changes to how agent memory and tool interfaces are structured so that immune monitoring can operate inside the loop.
In multi-agent environments the system could create feedback loops where one agent's immune response influences another's threat exposure.
Practical deployment would need to test whether the non-cognitive Barrier Immunity layer at L1 actually isolates logical components from higher cognitive layers under attack.

Load-bearing premise

Current defense mechanisms such as perimeter security and training-time alignment remain external to the agent's active reasoning loop.

What would settle it

A controlled test in which an agent equipped with the full ANIS architecture, including the Immune Tower and Harness Triad, is successfully hijacked through memory poisoning or tool-chain manipulation.

Figures

Figures reproduced from arXiv: 2606.28270 by Bo Shen, Dehui Li, Feng Shi, Lifeng Chang, Peijie Gao, Shiyi Kuang, Tianyuan Wei, Xin Chang, Yichen Han, Yunpeng Li.

**Figure 2.** Figure 2: Hierarchical taxonomy of Agent Viruses by attack surface and mechanism. Representative works [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The six-layer integer-indexed Agent-Native Immune Tower. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The Harness Triad as a closed loop: Self-harness detects anomalies and triggers vaccine requests; [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

The transition from static chat bots to autonomous agents--equipped with persistent memory, tool-use protocols, and multi-agent collaboration--has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter security and training-time alignment, remain external to the agent's active reasoning loop. Consequently, they fall short: a fully aligned agent remains highly vulnerable to runtime hijacking via memory poisoning, tool-chain manipulation, or multi-agent protocol attacks. To address this critical gap, we introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent's cognitive loop. Our framework presents four primary contributions. First, we design a six-layer Immune Tower (L0-L5), distinctly incorporating Barrier Immunity (L1) as a non-cognitive, physical-and-logical isolation layer. Second, we establish a unified taxonomy of Agent Viruses and Agent Vaccines, formalizing the critical distinction between superficial non-parametric defenses and robust parametric vaccines. Third, we conceptualize the Harness Triad--Meta, Self, and Auto--a self-monitoring, meta-cognitive automation backbone that drives Continual Immune Learning (CIL), enabling vaccines to dynamically adapt to novel threats. Finally, we establish a rigorous theoretical demarcation between model alignment and agent immunity: while alignment provides a static "constitutional" value foundation during training, ANIS serves as the dynamic "law enforcement" mechanism during runtime. We conclude by framing open challenges for the field, including immune protocol standardization, novel evaluation metrics such as the Autoimmunity Rate (false-positive intervention rate), and the co-evolutionary dynamics between pathogens and vaccines within collective intelligence ecosystems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a high-level conceptual sketch of an 'agent immune system' with new labels for layers and taxonomies but no mechanics, interfaces, or evidence.

read the letter

The main thing here is a proposed architecture called the Agent-Native Immune System meant to sit inside autonomous agents for runtime protection against memory poisoning and tool attacks. It stays at the level of categories without showing how anything works.

What stands out as new are the specific pieces they name: the six-layer Immune Tower that includes a non-cognitive Barrier Immunity layer at L1, the Harness Triad of Meta/Self/Auto for continual learning, and the split between agent viruses and vaccines. They also separate alignment as a training-time value base from immunity as a runtime enforcement mechanism. The paper does a reasonable job naming the gap between static external defenses and the dynamic threats that hit agents with persistent memory and multi-agent protocols.

The soft spots are large and central. No interfaces, data flows, or update rules are given for how any layer would actually read or change the agent's memory, tool calls, or collaboration. The 'embedded directly within the cognitive loop' claim is asserted through the definitions rather than constructed or demonstrated. This leaves the framework self-referential and hard to distinguish from other robustness ideas already discussed in the literature.

This kind of work is for people who sketch big-picture safety taxonomies and bio-inspired analogies. Readers who want concrete mechanisms, pseudocode, or falsifiable claims will find little to use. It does not rise to the level where a serious referee should spend time on it.

I would recommend against sending it to peer review until the authors add at least detailed integration specs or a small worked example.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce the Agent-Native Immune System (ANIS) as the first biologically inspired endogenous defense architecture embedded in autonomous agents' cognitive loops. It presents a six-layer Immune Tower (L0-L5) with non-cognitive Barrier Immunity at L1, a taxonomy distinguishing Agent Viruses from parametric Agent Vaccines, the Harness Triad (Meta/Self/Auto) to drive Continual Immune Learning, and a demarcation of ANIS as runtime 'law enforcement' versus static training-time alignment. The work concludes with open challenges including immune protocol standardization and the Autoimmunity Rate metric.

Significance. If the architecture were concretely specified with integration mechanisms, data flows, and empirical validation showing superiority over external defenses against runtime attacks such as memory poisoning, it could meaningfully advance runtime security for tool-using and multi-agent systems. As a purely taxonomic proposal without such grounding, its significance is limited to suggesting a new conceptual framing rather than delivering a verifiable advance.

major comments (3)

[Abstract] Abstract, first contribution: The claim that the six-layer Immune Tower is 'embedded directly within the agent's cognitive loop' as an endogenous runtime mechanism is not supported by any description of interfaces, observation points, or modification rules linking layers (including L1 Barrier Immunity) to the agent's persistent memory, tool invocations, or multi-agent protocols.
[Abstract] Abstract, third contribution: The Harness Triad is asserted to drive Continual Immune Learning that enables dynamic vaccine adaptation, yet no algorithms, update rules, state representations, or data-flow diagrams are supplied for how Meta/Self/Auto components monitor or alter agent behavior at runtime.
[Abstract] Abstract, final contribution: The 'rigorous theoretical demarcation' between model alignment (static constitutional foundation) and agent immunity (dynamic law enforcement) is stated without derivation, comparison to prior alignment or security literature, or formal criteria that would allow evaluation of the claimed distinction.

minor comments (1)

The manuscript would benefit from explicit section numbering and subsection headings to allow precise citation of the taxonomy and architecture details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our conceptual proposal for the Agent-Native Immune System. The comments accurately note that the manuscript focuses on high-level architecture and taxonomy rather than concrete implementations or empirical results. We will revise the paper to clarify these boundaries and expand explanatory elements where feasible.

read point-by-point responses

Referee: [Abstract] Abstract, first contribution: The claim that the six-layer Immune Tower is 'embedded directly within the agent's cognitive loop' as an endogenous runtime mechanism is not supported by any description of interfaces, observation points, or modification rules linking layers (including L1 Barrier Immunity) to the agent's persistent memory, tool invocations, or multi-agent protocols.

Authors: We agree that the current text presents the embedding at an architectural level without specifying interfaces or data flows. The manuscript's intent is to introduce the overall structure and the role of Barrier Immunity as a non-cognitive layer. In revision we will add a subsection describing candidate observation points and high-level integration patterns with memory, tools, and protocols, while explicitly noting that full protocol definitions remain future work. revision: yes
Referee: [Abstract] Abstract, third contribution: The Harness Triad is asserted to drive Continual Immune Learning that enables dynamic vaccine adaptation, yet no algorithms, update rules, state representations, or data-flow diagrams are supplied for how Meta/Self/Auto components monitor or alter agent behavior at runtime.

Authors: The Harness Triad is offered as a conceptual meta-cognitive backbone rather than an implemented controller. No concrete algorithms appear because the contribution centers on identifying the three components and their collective role in Continual Immune Learning. We will incorporate a high-level data-flow diagram and pseudocode sketches in the revised manuscript to illustrate the intended monitoring and adaptation loops. revision: yes
Referee: [Abstract] Abstract, final contribution: The 'rigorous theoretical demarcation' between model alignment (static constitutional foundation) and agent immunity (dynamic law enforcement) is stated without derivation, comparison to prior alignment or security literature, or formal criteria that would allow evaluation of the claimed distinction.

Authors: The demarcation is drawn from the timing distinction (training-time static values versus runtime dynamic enforcement) and is supported by brief references in the text. We accept that a more formal derivation and explicit comparison table would strengthen the claim. The revision will expand the related-work discussion, add citations to alignment and runtime-security literature, and include a side-by-side criteria table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; conceptual taxonomy paper with no load-bearing derivations.

full rationale

The manuscript is a high-level architecture proposal that defines new terms (ANIS, six-layer Immune Tower, Agent Viruses/Vaccines taxonomy, Harness Triad, Autoimmunity Rate) and states distinctions such as alignment versus immunity. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described contributions. No self-citations are invoked as external justification for uniqueness theorems or ansatzes. The central claims rest on definitional introduction rather than any reduction of outputs to inputs by construction, satisfying the default expectation of no circularity for such papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The proposal rests on several untested transfers from biology and security concepts without independent evidence or derivation; multiple new entities are introduced by definition.

axioms (2)

domain assumption Biological immune systems supply a transferable model for designing runtime defenses in AI agents
Invoked throughout the abstract as the basis for the entire architecture without justification of transferability.
domain assumption Runtime hijacking threats cannot be adequately addressed by external or training-time methods
Stated as the motivation for ANIS in the abstract.

invented entities (4)

Agent-Native Immune System (ANIS) no independent evidence
purpose: Endogenous defense architecture inside agent cognitive loop
New system introduced without prior existence or validation.
Immune Tower (L0-L5) with Barrier Immunity (L1) no independent evidence
purpose: Six-layer defense structure including non-cognitive isolation
Invented layered model presented as core contribution.
Harness Triad (Meta, Self, Auto) no independent evidence
purpose: Self-monitoring meta-cognitive backbone for continual immune learning
New conceptual component for adaptation.
Agent Viruses and Agent Vaccines no independent evidence
purpose: Taxonomy distinguishing superficial and parametric threats/defenses
New classification categories introduced by the paper.

pith-pipeline@v0.9.1-grok · 5854 in / 1615 out tokens · 73797 ms · 2026-06-29T03:48:13.085144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 9 linked inside Pith

[1]

Bowman, Zac Hatfield-Dodds,BenMann,DarioAmodei,NicholasJoseph,SamMcCandlish,TomBrown,andJared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Danny 15 Novo Ordo for AI2026-06-24 Hernandez, Deep Drain, Dustin Ganguli, Eli Li, Ethan Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Joshua Ladish, Joshua Landau, Kama...

Pith/arXiv arXiv 2022
[2]

Agent behavioral contracts: Formal runtime constraints for autonomous ai systems.arXiv preprint arXiv:2602.22302, 2026

Varun Pratap Bhardwaj et al. Agent behavioral contracts: Formal runtime constraints for autonomous ai systems.arXiv preprint arXiv:2602.22302, 2026

arXiv 2026
[3]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2024

Pith/arXiv arXiv 2024
[4]

Agentpoison: Red-teaming llm agents via memory and knowledge base injection

Zhen Chen et al. Agentpoison: Red-teaming llm agents via memory and knowledge base injection. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 1–15, 2024

2024
[5]

Hofmeyr, and Anil Somayaji

Stephanie Forrest, Steven A. Hofmeyr, and Anil Somayaji. A sense of self for unix processes. pages 120–128, 1997

1997
[6]

Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278, 2025

Yuxin Hou et al. Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278, 2025

Pith/arXiv arXiv 2025
[7]

Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

Wenxin Hu et al. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

Pith/arXiv arXiv 2026
[8]

Trustagent: Aframeworkforsafeandtrustworthyllm-basedagents

YansongHuaetal. Trustagent: Aframeworkforsafeandtrustworthyllm-basedagents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1–10, 2024

2024
[9]

Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Jinhyuk Lee et al. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Pith/arXiv arXiv 2026
[10]

Xinran Li et al. Mcpinspect: A systematic study of cross-entity security risks in the model context protocolecosystem.InProceedingsoftheACMSIGSACConferenceonComputerandCommunications Security (CCS), 2025. To appear

2025
[11]

Autoharness: Improving llm agents by automatically synthesizing a code harness

Jianning Lou et al. Autoharness: Improving llm agents by automatically synthesizing a code harness. arXiv preprint arXiv:2603.03329, 2026

arXiv 2026
[12]

Aegis: Cryptographic runtime governance for autonomous ai agents.arXiv preprint arXiv:2603.16938, 2026

Adam Massimo Mazzocchetti et al. Aegis: Cryptographic runtime governance for autonomous ai agents.arXiv preprint arXiv:2603.16938, 2026

arXiv 2026
[13]

Amplified vulnerabilities: Structured jailbreak attacks on llm-based multi-agent debate

Jiaqi Qi et al. Amplified vulnerabilities: Structured jailbreak attacks on llm-based multi-agent debate. arXiv preprint arXiv:2504.16489, 2025

arXiv 2025
[14]

Open challenges in multi-agent security: Towards secure systems of interacting ai agents.arXiv preprint arXiv:2505.02077, 2025

Christian Schroeder de Witt et al. Open challenges in multi-agent security: Towards secure systems of interacting ai agents.arXiv preprint arXiv:2505.02077, 2025

Pith/arXiv arXiv 2025
[15]

Toolhijacker: Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Yujia Shi et al. Toolhijacker: Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Pith/arXiv arXiv 2025
[16]

Say what you think: Unfaithful chain-of-thought explanations in llms

Miles Turpin et al. Say what you think: Unfaithful chain-of-thought explanations in llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 1–15, 2023. 16 Novo Ordo for AI2026-06-24

2023
[17]

Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

AnayVijayvargiyaetal. Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

arXiv 2026
[18]

V. V. Vishnyakova et al. From prompts to corporate multi-agent architecture: The intent engineering layer.arXiv preprint arXiv:2603.09619, 2026. To appear

arXiv 2026
[19]

Thoughtviruses: Viralmisalignment in multi-agent systems via subliminal prompting.arXiv preprint arXiv:2603.00131, 2026

JonasWeckbecker,PaulMüller,AmirHagag,andThomasMulet. Thoughtviruses: Viralmisalignment in multi-agent systems via subliminal prompting.arXiv preprint arXiv:2603.00131, 2026

arXiv 2026
[20]

Injecagent: Benchmarking indirect prompt injection in tool-integrated llm agents

Qinlin Zhan et al. Injecagent: Benchmarking indirect prompt injection in tool-integrated llm agents. InFindings of the Association for Computational Linguistics (ACL Findings), pages 1–15, 2024

2024
[21]

Agent security bench (asb): A comprehensive benchmark for real-world agent safety

Tianlin Zhang et al. Agent security bench (asb): A comprehensive benchmark for real-world agent safety. InInternational Conference on Learning Representations (ICLR), 2025. To appear

2025
[22]

Self-harness: Harnesses that improve themselves.arXiv preprint arXiv:2606.09498, 2026

Tianyuan Zhang et al. Self-harness: Harnesses that improve themselves.arXiv preprint arXiv:2606.09498, 2026

Pith/arXiv arXiv 2026
[23]

Hijackrag: Hijacking retrieval-augmented generation in llm agents.arXiv preprint arXiv:2410.22832, 2024

Wei Zhang et al. Hijackrag: Hijacking retrieval-augmented generation in llm agents.arXiv preprint arXiv:2410.22832, 2024

arXiv 2024
[24]

Memmorph: Memory poisoning for llm agents via structured record injection

Xuanye Zhang et al. Memmorph: Memory poisoning for llm agents via structured record injection. arXiv preprint arXiv:2605.26154, 2026

Pith/arXiv arXiv 2026
[25]

Mcpsecuritybench: Alarge-scalebenchmarkformodelcontextprotocolsecurity

YimingZhangetal. Mcpsecuritybench: Alarge-scalebenchmarkformodelcontextprotocolsecurity. InInternational Conference on Learning Representations (ICLR), 2026. To appear. 17

2026

[1] [1]

Bowman, Zac Hatfield-Dodds,BenMann,DarioAmodei,NicholasJoseph,SamMcCandlish,TomBrown,andJared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Danny 15 Novo Ordo for AI2026-06-24 Hernandez, Deep Drain, Dustin Ganguli, Eli Li, Ethan Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Joshua Ladish, Joshua Landau, Kama...

Pith/arXiv arXiv 2022

[2] [2]

Agent behavioral contracts: Formal runtime constraints for autonomous ai systems.arXiv preprint arXiv:2602.22302, 2026

Varun Pratap Bhardwaj et al. Agent behavioral contracts: Formal runtime constraints for autonomous ai systems.arXiv preprint arXiv:2602.22302, 2026

arXiv 2026

[3] [3]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2024

Pith/arXiv arXiv 2024

[4] [4]

Agentpoison: Red-teaming llm agents via memory and knowledge base injection

Zhen Chen et al. Agentpoison: Red-teaming llm agents via memory and knowledge base injection. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 1–15, 2024

2024

[5] [5]

Hofmeyr, and Anil Somayaji

Stephanie Forrest, Steven A. Hofmeyr, and Anil Somayaji. A sense of self for unix processes. pages 120–128, 1997

1997

[6] [6]

Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278, 2025

Yuxin Hou et al. Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278, 2025

Pith/arXiv arXiv 2025

[7] [7]

Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

Wenxin Hu et al. Lying with truths: Open-channel multi-agent collusion for belief manipulation via generative montage.arXiv preprint arXiv:2601.01685, 2026

Pith/arXiv arXiv 2026

[8] [8]

Trustagent: Aframeworkforsafeandtrustworthyllm-basedagents

YansongHuaetal. Trustagent: Aframeworkforsafeandtrustworthyllm-basedagents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1–10, 2024

2024

[9] [9]

Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Jinhyuk Lee et al. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

Pith/arXiv arXiv 2026

[10] [10]

Xinran Li et al. Mcpinspect: A systematic study of cross-entity security risks in the model context protocolecosystem.InProceedingsoftheACMSIGSACConferenceonComputerandCommunications Security (CCS), 2025. To appear

2025

[11] [11]

Autoharness: Improving llm agents by automatically synthesizing a code harness

Jianning Lou et al. Autoharness: Improving llm agents by automatically synthesizing a code harness. arXiv preprint arXiv:2603.03329, 2026

arXiv 2026

[12] [12]

Aegis: Cryptographic runtime governance for autonomous ai agents.arXiv preprint arXiv:2603.16938, 2026

Adam Massimo Mazzocchetti et al. Aegis: Cryptographic runtime governance for autonomous ai agents.arXiv preprint arXiv:2603.16938, 2026

arXiv 2026

[13] [13]

Amplified vulnerabilities: Structured jailbreak attacks on llm-based multi-agent debate

Jiaqi Qi et al. Amplified vulnerabilities: Structured jailbreak attacks on llm-based multi-agent debate. arXiv preprint arXiv:2504.16489, 2025

arXiv 2025

[14] [14]

Open challenges in multi-agent security: Towards secure systems of interacting ai agents.arXiv preprint arXiv:2505.02077, 2025

Christian Schroeder de Witt et al. Open challenges in multi-agent security: Towards secure systems of interacting ai agents.arXiv preprint arXiv:2505.02077, 2025

Pith/arXiv arXiv 2025

[15] [15]

Toolhijacker: Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Yujia Shi et al. Toolhijacker: Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

Pith/arXiv arXiv 2025

[16] [16]

Say what you think: Unfaithful chain-of-thought explanations in llms

Miles Turpin et al. Say what you think: Unfaithful chain-of-thought explanations in llms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 1–15, 2023. 16 Novo Ordo for AI2026-06-24

2023

[17] [17]

Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

AnayVijayvargiyaetal. Openagentsafety: Aframeworkforevaluatingreal-worldaiagentsafety.arXiv preprint arXiv:2507.06134, 2026

arXiv 2026

[18] [18]

V. V. Vishnyakova et al. From prompts to corporate multi-agent architecture: The intent engineering layer.arXiv preprint arXiv:2603.09619, 2026. To appear

arXiv 2026

[19] [19]

Thoughtviruses: Viralmisalignment in multi-agent systems via subliminal prompting.arXiv preprint arXiv:2603.00131, 2026

JonasWeckbecker,PaulMüller,AmirHagag,andThomasMulet. Thoughtviruses: Viralmisalignment in multi-agent systems via subliminal prompting.arXiv preprint arXiv:2603.00131, 2026

arXiv 2026

[20] [20]

Injecagent: Benchmarking indirect prompt injection in tool-integrated llm agents

Qinlin Zhan et al. Injecagent: Benchmarking indirect prompt injection in tool-integrated llm agents. InFindings of the Association for Computational Linguistics (ACL Findings), pages 1–15, 2024

2024

[21] [21]

Agent security bench (asb): A comprehensive benchmark for real-world agent safety

Tianlin Zhang et al. Agent security bench (asb): A comprehensive benchmark for real-world agent safety. InInternational Conference on Learning Representations (ICLR), 2025. To appear

2025

[22] [22]

Self-harness: Harnesses that improve themselves.arXiv preprint arXiv:2606.09498, 2026

Tianyuan Zhang et al. Self-harness: Harnesses that improve themselves.arXiv preprint arXiv:2606.09498, 2026

Pith/arXiv arXiv 2026

[23] [23]

Hijackrag: Hijacking retrieval-augmented generation in llm agents.arXiv preprint arXiv:2410.22832, 2024

Wei Zhang et al. Hijackrag: Hijacking retrieval-augmented generation in llm agents.arXiv preprint arXiv:2410.22832, 2024

arXiv 2024

[24] [24]

Memmorph: Memory poisoning for llm agents via structured record injection

Xuanye Zhang et al. Memmorph: Memory poisoning for llm agents via structured record injection. arXiv preprint arXiv:2605.26154, 2026

Pith/arXiv arXiv 2026

[25] [25]

Mcpsecuritybench: Alarge-scalebenchmarkformodelcontextprotocolsecurity

YimingZhangetal. Mcpsecuritybench: Alarge-scalebenchmarkformodelcontextprotocolsecurity. InInternational Conference on Learning Representations (ICLR), 2026. To appear. 17

2026