pith. sign in

arxiv: 2509.06572 · v5 · submitted 2025-09-08 · 💻 cs.CR

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

Pith reviewed 2026-05-18 18:39 UTC · model grok-4.3

classification 💻 cs.CR
keywords Model Context ProtocolLLM securitytoolchain attacksprivacy leakageparasitic attacksMCP-UPDtool poisoningprompt injection
0
0 comments X

The pith

MCP toolchains can be hijacked to leak private data by embedding malicious instructions in external sources accessed by LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies a systematic privacy-leakage attack pattern in the Model Context Protocol called Parasitic Toolchain Attacks or MCP-UPD. Adversaries embed malicious instructions into external data sources that LLMs access during legitimate tasks, which then propagate unchecked due to missing isolation and privilege controls. These instructions assemble multiple legitimate tools into a workflow that collects and discloses private data in three phases. A reader would care because this shifts the attack surface from manipulating outputs to hijacking entire execution flows in LLM-powered applications. The large-scale analysis of 12,230 tools across 1,360 servers reveals the ecosystem is rife with exploitable gadgets.

Core claim

The Model Context Protocol lacks both context-tool isolation and least-privilege enforcement. This enables adversarial instructions to propagate unchecked into sensitive tool invocations, allowing the assembly of legitimate tools into coordinated malicious workflows that culminate in stealthy privacy exfiltration through parasitic ingestion, collection, and disclosure phases.

What carries the argument

The Parasitic Toolchain Attack pattern (MCP-UPD) that infiltrates via external data and coordinates tools for privacy disclosure.

If this is right

  • Attackers can achieve malicious goals by targeting the toolchain rather than individual prompts or tools.
  • LLM integrations with external systems become vulnerable to indirect, no-interaction attacks.
  • Many existing MCP servers contain gadgets that facilitate such attacks.
  • Defense mechanisms are urgently needed to secure LLM-integrated environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar vulnerabilities likely exist in other emerging LLM tool orchestration protocols.
  • Implementing context isolation could prevent propagation of malicious instructions across tools.
  • Regular security censuses of tool ecosystems should be conducted to monitor for exploitable patterns.
  • Testing the attack in real production LLM deployments would validate the practical risk.

Load-bearing premise

The tools and servers analyzed are representative of real-world MCP deployments, and the attack can be carried out in production without extra attacker capabilities.

What would settle it

An experiment showing an LLM using MCP tools processes a maliciously crafted external data source and then invokes tools to exfiltrate private user data.

Figures

Figures reproduced from arXiv: 2509.06572 by Libo Chen, Qinsheng Hou, Shenghong Li, Shuli Zhao, Yanhao Wang, Yuchong Xie, Yu Guo, Zhi Xue, Zihan Zhan.

Figure 1
Figure 1. Figure 1: Overview of the MCP workflow architecture. The diagram illustrates the seven-step process: ➀ Initializing connections between MCP Host and Servers, ➁ Prompt Formatting of user requests, ➂ Decision making by the Large Language Model, ➃ Tool Invoking through MCP clients, ➄ Tool Executing on MCP servers, ➅ Result organizing by the LLM, and ➆ Result Presenting to the user. The architecture shows the MCP Host m… view at source ↗
Figure 2
Figure 2. Figure 2: Attack process of MCP-UPD (MCP Unauthorized Privacy Disclosure). The diagram illustrates a three￾phase parasitic toolchain attack: ➀ Parasitic Ingestion: the user invokes an external ingestion tool (get_posts) to retrieve content containing a malicious prompt that instructs the agent to perform unauthorized actions; ➁ Privacy Collection: the compromised agent follows the injected instructions to access sen… view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of MCP-SEC. To answer this question, the following section introduces our automated analysis framework, which systematically collects and examines MCP servers to identify tools that can be exploited in MCP-UPD. This bridges the gap between understanding the attack mechanism and assessing its real-world ecosystem-wide impact. 4 Design of MCP-SEC To understand the impact of MCP-UPD at scale, it … view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of exploitable MCP tools/servers. EIT/S represents the External Ingestion Tool/Server, PAT/S represents the Privacy Access Tool/Server, NAT/S represents the Network Access Tool/Server. After further analyzing these 1,062 tools, we revealed the detailed distribution of their risk-related capabilities, as shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of GitHub stars of MCP servers with risky tools [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Large language models(LLMs) are increasingly integrated with external systems through the Model Context Protocol(MCP),which standardizes tool invocation and has rapidly become a backbone for LLM-powered applications. While this paradigm enhances functionality,it also introduces a fundamental security shift:LLMs transition from passive information processors to autonomous orchestrators of task-oriented toolchains,expanding the attack surface,elevating adversarial goals from manipulating single outputs to hijacking entire execution flows. In this paper,we identify and characterize a systematic privacy-leakage attack pattern,termed Parasitic Toolchain Attacks,instantiated as MCP Unintended Privacy Disclosure(MCP-UPD). These attacks require no direct victim interaction;instead,adversaries embed malicious instructions into external data sources that LLMs access during legitimate tasks. Unlike traditional prompt injection and tool poisoning attacks,our attack targets the interconnected toolchain itself,assembling multiple legitimate tools into a coordinated workflow whose combined behavior accomplishes malicious objectives. In MCP-UPD,the malicious logic infiltrates the toolchain and unfolds in three phases:Parasitic Ingestion,Privacy Collection,and Privacy Disclosure,culminating in stealthy exfiltration of private data. Our root cause analysis reveals that MCP lacks both context-tool isolation and least-privilege enforcement,enabling adversarial instructions to propagate unchecked into sensitive tool invocations. To assess the severity,we design MCP-SEC and conduct the first large-scale security census of the MCP ecosystem,analyzing 12230 tools across 1360 servers. Our findings show that the MCP ecosystem is rife with real-world exploitable gadgets and diverse attack methods,underscoring systemic risks in MCP platforms and the urgent need for defense mechanisms in LLM-integrated environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Parasitic Toolchain Attacks on the Model Context Protocol (MCP) ecosystem for LLM tool integrations. It defines MCP Unintended Privacy Disclosure (MCP-UPD) as a three-phase attack (Parasitic Ingestion, Privacy Collection, Privacy Disclosure) in which adversaries embed malicious instructions in external data sources that LLMs access during legitimate tasks; these instructions then hijack coordinated legitimate tools to exfiltrate private data without direct victim interaction. The authors perform a root-cause analysis attributing the vulnerability to MCP's lack of context-tool isolation and least-privilege enforcement, design the MCP-SEC scanner, and report a large-scale census of 12,230 tools across 1,360 servers that identifies widespread exploitable gadgets and diverse attack methods.

Significance. If the empirical claims are strengthened by execution-based validation, the work would be significant as the first large-scale security census of the emerging MCP ecosystem. It surfaces a novel attack pattern that targets interconnected toolchains rather than single prompts or tools, and the scale of the scan (over 12k tools) provides concrete evidence of systemic risks that could guide protocol-level defenses in LLM-integrated environments. The empirical focus and identification of real-world gadgets are clear strengths.

major comments (2)
  1. [§5] §5 (Large-scale analysis / MCP-SEC description): The census catalogs tools with data-access or exfiltration capabilities via static inspection of schemas and descriptions, yet provides no evidence that these gadgets can be assembled into the full three-phase MCP-UPD workflow when an LLM processes external data containing parasitic instructions. Without dynamic execution of the Parasitic Ingestion → Privacy Collection → Privacy Disclosure sequence in a live MCP client connected to an LLM, the central claim that the ecosystem is 'rife with real-world exploitable gadgets' rests on an untested assumption rather than an observed outcome.
  2. [§4] §4 (Attack construction and root-cause analysis): The diagnosis that MCP lacks context-tool isolation and least-privilege enforcement is plausible from the protocol overview, but the manuscript does not include concrete traces or examples showing how an adversarial instruction embedded in external data actually propagates through the toolchain to trigger a sensitive tool invocation under realistic task conditions. This gap directly affects the severity conclusion.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 use the term 'MCP-UPD' before it is formally defined; a forward reference or early definition would improve readability.
  2. [Table 1] Table 1 (or equivalent summary of scan results) reports aggregate counts but does not break down the fraction of tools that were manually validated versus automatically flagged; adding this would clarify the reliability of the 12,230-tool census.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the novelty of the Parasitic Toolchain Attack pattern and the scale of our MCP ecosystem census. We address each major comment below and describe the revisions we will incorporate to strengthen the empirical grounding of our claims.

read point-by-point responses
  1. Referee: [§5] §5 (Large-scale analysis / MCP-SEC description): The census catalogs tools with data-access or exfiltration capabilities via static inspection of schemas and descriptions, yet provides no evidence that these gadgets can be assembled into the full three-phase MCP-UPD workflow when an LLM processes external data containing parasitic instructions. Without dynamic execution of the Parasitic Ingestion → Privacy Collection → Privacy Disclosure sequence in a live MCP client connected to an LLM, the central claim that the ecosystem is 'rife with real-world exploitable gadgets' rests on an untested assumption rather than an observed outcome.

    Authors: We agree that dynamic validation of full workflows would provide stronger confirmation that the identified gadgets can be composed into complete MCP-UPD attacks. Our MCP-SEC scanner employs static analysis of tool schemas and descriptions precisely to enable a scalable census across more than 12,000 tools, which would be infeasible to replicate dynamically at full scale. Nevertheless, we conducted targeted dynamic experiments during our research to validate representative gadget assemblies. In the revised manuscript we will add a dedicated subsection to §5 that reports these execution-based case studies, including concrete traces of the three-phase sequence executed in a live MCP client connected to an LLM under controlled conditions. revision: yes

  2. Referee: [§4] §4 (Attack construction and root-cause analysis): The diagnosis that MCP lacks context-tool isolation and least-privilege enforcement is plausible from the protocol overview, but the manuscript does not include concrete traces or examples showing how an adversarial instruction embedded in external data actually propagates through the toolchain to trigger a sensitive tool invocation under realistic task conditions. This gap directly affects the severity conclusion.

    Authors: We concur that explicit propagation traces would make the root-cause analysis more compelling. Section 4 currently grounds the diagnosis in the MCP protocol specification and outlines the three attack phases with illustrative scenarios. To directly address this concern, we will expand §4 in the revision with additional concrete execution traces drawn from our prototype attack implementations, demonstrating step-by-step how an adversarial instruction in external data propagates through the MCP context to invoke sensitive tools under realistic task conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical census of MCP tools with no derivations or self-referential predictions

full rationale

This is a measurement study that catalogs tool capabilities across 12230 instances and diagnoses protocol-level isolation failures from the MCP specification. No equations, fitted parameters, or first-principles derivations appear. Claims about parasitic ingestion, privacy collection, and disclosure rest directly on the observed tool schemas and the absence of context-tool isolation in the protocol description; they are not redefined or predicted from prior results within the paper itself. The analysis is self-contained against external benchmarks (the public MCP ecosystem) and does not reduce any central finding to a tautology or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on domain assumptions about MCP's design and the novelty of the attack pattern; no numerical free parameters are introduced.

axioms (1)
  • domain assumption MCP standardizes tool invocation for LLMs and is rapidly becoming a backbone for LLM-powered applications
    Invoked in the opening of the abstract as background for the security shift described.
invented entities (2)
  • Parasitic Toolchain Attacks no independent evidence
    purpose: To name and characterize the new systematic privacy-leakage attack pattern
    Introduced as the core contribution distinct from prompt injection and tool poisoning.
  • MCP-UPD no independent evidence
    purpose: Specific instantiation of the attack for unintended privacy disclosure
    Defined as the concrete attack realized through the three phases.

pith-pipeline@v0.9.0 · 5861 in / 1401 out tokens · 60115 ms · 2026-05-18T18:39:36.232433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A First Measurement Study on Authentication Security in Real-World Remote MCP Servers

    cs.CR 2026-05 conditional novelty 8.0

    First measurement study of 7,973 remote MCP servers finds 40.55% lack authentication and all 119 tested OAuth servers have flaws that risk data leaks or account takeover.

  2. MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security

    cs.CR 2026-04 conditional novelty 7.0

    MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.

  3. From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers

    cs.CR 2026-04 unverdicted novelty 7.0

    Presents a component-centric PoC dataset of malicious MCP servers and a two-stage behavioral deviation detector Connor achieving 94.6% F1-score.

  4. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    cs.CR 2025-03 unverdicted novelty 7.0

    MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.

  5. Behavioral Integrity Verification for AI Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.

  6. Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

    cs.SE 2026-05 unverdicted novelty 6.0

    MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.

  7. MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems

    cs.CR 2026-04 unverdicted novelty 6.0

    MCPThreatHive automates the full lifecycle of threat intelligence for MCP agentic systems using a new 38-pattern taxonomy mapped to STRIDE and OWASP frameworks plus composite risk scoring.

  8. Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP

    cs.CR 2026-02 unverdicted novelty 5.0

    The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 8 Pith papers · 4 internal anchors

  1. [1]

    Prompt Security Top 10: Key Security Risks for MCPs, Accessed: 2025-10-27

  2. [2]

    Protecting against indirect prompt injection attacks in MCP, Accessed: 2025-10-27

  3. [3]

    MCP-Sec.https://anonymous.4open.science/r/MCP-SEC-3FD0/, Accessed: 2025-11-09

  4. [4]

    https://secresearcher100.github.io/, Accessed: 2025-11-09

    The Demo Site of Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem. https://secresearcher100.github.io/, Accessed: 2025-11-09

  5. [5]

    https://github.com/punkpeye/awesome-mcp-servers/blob/main/README.md, Ac- cessed: 2025-11-13

    Awesome MCP Servers. https://github.com/punkpeye/awesome-mcp-servers/blob/main/README.md, Ac- cessed: 2025-11-13

  6. [6]

    Cline - AI Coding, Open Source and Uncompromised.https://cline.bot/, Accessed: 2025-11-13

  7. [7]

    Cursor - The AI Code Editor.https://cursor.com/home, Accessed: 2025-11-13

  8. [8]

    Discover Top MCP Servers | MCP Market.https://mcpmarket.com/, Accessed: 2025-11-13

  9. [9]

    Download Claude.https://claude.ai/download, Accessed: 2025-11-13

  10. [10]

    https://github.com/baranwang/mcp-trends-hub, Accessed: 2025-11- 13

    GitHub - baranwang/mcp-tredns-hub. https://github.com/baranwang/mcp-trends-hub, Accessed: 2025-11- 13

  11. [11]

    https://github.com/brightdata/brightdata-mcp, Accessed: 2025-11- 13

    GitHub - brightdata/brightdata-mcp. https://github.com/brightdata/brightdata-mcp, Accessed: 2025-11- 13

  12. [12]

    https://github.com/cswkim/discogs-mcp-server, Accessed: 2025- 11-13

    GitHub - cswkim/discogs-mcp-server. https://github.com/cswkim/discogs-mcp-server, Accessed: 2025- 11-13

  13. [13]

    GitHub - deanward/hal.https://github.com/deanward/hal, Accessed: 2025-11-13

  14. [14]

    GitHub - devabdultech/hn-mcp.https://github.com/devabdultech/hn-mcp, Accessed: 2025-11-13

  15. [15]

    GitHub - evalsone/mcp-connect.https://github.com/evalsone/mcp-connect, Accessed: 2025-11-13

  16. [16]

    GitHub - ivo-toby/contentful-mcp.https://github.com/ivo-toby/contentful-mcp, Accessed: 2025-11-13

  17. [17]

    https://github.com/modelcontextprotocol/python-sdk, Ac- cessed: 2025-11-13

    GitHub - modelcontextprotocol/python-sdk. https://github.com/modelcontextprotocol/python-sdk, Ac- cessed: 2025-11-13

  18. [18]

    17 Mind Your Server: A Systematic Study of Parasitic Attacks on the MCP Ecosystem

    GitHub - oschina/mcp-gitee.https://github.com/oschina/mcp-gitee, Accessed: 2025-11-13. 17 Mind Your Server: A Systematic Study of Parasitic Attacks on the MCP Ecosystem

  19. [19]

    https://github.com/pimzino/agentic-tools-mcp, Accessed: 2025- 11-13

    GitHub - pimzino/agentic-tools-mcp. https://github.com/pimzino/agentic-tools-mcp, Accessed: 2025- 11-13

  20. [20]

    https://github.com/wonderwhy-er/ DesktopCommanderMCP, Accessed: 2025-11-13

    GitHub - wonderwhy-er/DesktopCommanderMCP. https://github.com/wonderwhy-er/ DesktopCommanderMCP, Accessed: 2025-11-13

  21. [21]

    https://modelcontextprotocol.io/docs/getting-started/intro, Accessed: 2025-11-13

    Introduction - Model Context Protocol. https://modelcontextprotocol.io/docs/getting-started/intro, Accessed: 2025-11-13

  22. [22]

    https://invariantlabs.ai/blog/mcp-security- notification-tool-poisoning-attacks, Accessed: 2025-11-13

    MCP Security Notification: Tool Poisoning Attacks. https://invariantlabs.ai/blog/mcp-security- notification-tool-poisoning-attacks, Accessed: 2025-11-13

  23. [23]

    MCP Server Directory | PulseMCP.https://www.pulsemcp.com/servers, Accessed: 2025-11-13

  24. [24]

    npx | npm Docs.https://docs.npmjs.com/cli/v8/commands/npx, Accessed: 2025-11-13

  25. [25]

    Using tools | uv.https://docs.astral.sh/uv/guides/tools/, Accessed: 2025-11-13

  26. [26]

    Can llm-generated misinformation be detected? In NeurIPS 2023 Workshop on Regulatable ML, 2023

    Canyu Chen and Kai Shu. Can llm-generated misinformation be detected? In NeurIPS 2023 Workshop on Regulatable ML, 2023

  27. [27]

    Bias and unfairness in information retrieval systems: New challenges in the llm era

    Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. Bias and unfairness in information retrieval systems: New challenges in the llm era. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6437–6447, 2024

  28. [28]

    they are uncultured

    Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, and Tanu Mitra. “they are uncultured”: Unveiling covert harms and social threats in llm generated conversations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20339–20369, 2024

  29. [29]

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020

  30. [30]

    Systematic analysis of mcp security,

    Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, and Sheng Wen. Systematic analysis of mcp security. arXiv preprint arXiv:2508.12538, 2025

  31. [31]

    Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

    Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 11143–11156, 2024

  32. [32]

    Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

    Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E Hassan. Model context protocol (mcp) at first glance: Studying the security and maintainability of mcp servers. arXiv preprint arXiv:2506.13538, 2025

  33. [33]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025

  34. [34]

    Towards mitigating llm hallucination via self reflection

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, 2023

  35. [35]

    Mcp guardian: A security-first layer for safeguarding mcp-based ai system

    Sonu Kumar, Anubhav Girdhar, Ritesh Patil, and Divyansh Tripathi. Mcp guardian: A security-first layer for safeguarding mcp-based ai system. arXiv preprint arXiv:2504.12757, 2025

  36. [36]

    We urgently need privilege management in mcp: A measurement of api usage in mcp ecosystems

    Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, and Xiuzhen Cheng. We urgently need privilege management in mcp: A measurement of api usage in mcp ecosystems. arXiv preprint arXiv:2507.06250, 2025

  37. [37]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36:21558–21572, 2023

  38. [38]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023

  39. [39]

    Enterprise-grade security for the model context protocol (mcp): Frameworks and mitigation strategies.arXiv preprint arXiv:2504.08623, 2025

    Vineeth Sai Narajala and Idan Habler. Enterprise-grade security for the model context protocol (mcp): Frameworks and mitigation strategies. arXiv preprint arXiv:2504.08623, 2025

  40. [40]

    {CodexLeaks}: Privacy leaks from code generation language models in {GitHub} copilot

    Liang Niu, Shujaat Mirza, Zayd Maradni, and Christina Pöpper. {CodexLeaks}: Privacy leaks from code generation language models in {GitHub} copilot. In 32nd USENIX Security Symposium (USENIX Security 23), pages 2133–2150, 2023. 18 Mind Your Server: A Systematic Study of Parasitic Attacks on the MCP Ecosystem

  41. [41]

    On the risk of misinformation pollution with large language models

    Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Wang. On the risk of misinformation pollution with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1389–1403, 2023

  42. [42]

    Gorilla: Large language model connected with massive apis

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37:126544–126565, 2024

  43. [43]

    Mcp safety audit: Llms with the model context protocol allow major security exploits,

    Brandon Radosevich and John Halloran. Mcp safety audit: Llms with the model context protocol allow major security exploits. arXiv preprint arXiv:2504.03767, 2025

  44. [44]

    Characteristics of harmful text: Towards rigorous benchmarking of language models

    Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, et al. Characteristics of harmful text: Towards rigorous benchmarking of language models. Advances in Neural Information Processing Systems, 35:24720–24739, 2022

  45. [45]

    HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

    Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas Zannettou, and Yang Zhang. HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns. In USENIX Security Symposium (USENIX Security). USENIX, 2025

  46. [46]

    Beyond the protocol: Unveiling attack vectors in the model context protocol ecosystem.arXiv preprint arXiv:2506.02040, 2025

    Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, and Jiachi Chen. Beyond the protocol: Unveiling attack vectors in the model context protocol ecosystem. arXiv preprint arXiv:2506.02040, 2025

  47. [47]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952–74965, 2023

  48. [48]

    Mcptox: A benchmark for tool poisoning attack on real-world mcp servers

    Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guanquan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning attack on real-world mcp servers. arXiv preprint arXiv:2508.14925, 2025

  49. [49]

    Mpma: Preference manipulation attack against model context protocol.arXiv preprint arXiv:2505.11154, 2025

    Zihan Wang, Hongwei Li, Rui Zhang, Yu Liu, Wenbo Jiang, Wenshu Fan, Qingchuan Zhao, and Guowen Xu. Mpma: Preference manipulation attack against model context protocol. arXiv preprint arXiv:2505.11154, 2025

  50. [50]

    On the security of tool-invocation prompts for llm-based agentic systems: An empirical risk assessment

    Yuchong Xie, Mingyu Luo, Zesen Liu, Zhixiang Zhang, Kaikai Zhang, Yu Liu, Zongjie Li, Ping Chen, Shuai Wang, and Dongdong She. On the security of tool-invocation prompts for llm-based agentic systems: An empirical risk assessment. arXiv preprint arXiv:2509.05755, 2025

  51. [51]

    Mcpsecbench: A systematic security benchmark and playground for testing model context protocols

    Yixuan Yang, Daoyuan Wu, and Yufan Chen. Mcpsecbench: A systematic security benchmark and playground for testing model context protocols. arXiv preprint arXiv:2508.13220, 2025

  52. [52]

    LLM lies: Hallucinations are not bugs, but features as adversarial examples,

    Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023

  53. [53]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  54. [54]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 14322–14350, 2024

  55. [55]

    On large language models’ resilience to coercive interrogation

    Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. On large language models’ resilience to coercive interrogation. In 2024 IEEE Symposium on Security and Privacy (SP), pages 826–844. IEEE, 2024

  56. [56]

    When mcp servers attack: Taxonomy, feasibility, and mitigation,

    Weibo Zhao, Jiahao Liu, Bonan Ruan, Shaofei Li, and Zhenkai Liang. When mcp servers attack: Taxonomy, feasibility, and mitigation. arXiv preprint arXiv:2509.24272, 2025

  57. [57]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 19