arxiv: 2605.11053 · v2 · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG

Recognition: unknown

MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

Sultan Zavrak

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM agent securitytool call monitoringattack detectiongraph neural networkssentence embeddingsMCP protocolcontent-aware detection

0 comments

The pith

MCPShield shows that content embeddings of tool arguments and responses are essential for detecting attacks on LLM agent traffic, lifting AUROC from 0.64 to above 0.89.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCPShield as a system to monitor tool-call sessions in LLM agents for signs of attack. It builds a graph for each session with tool calls as nodes linked by sequence and data flow, then adds sentence embeddings of the actual argument and response text to those nodes. Metadata features alone keep detection accuracy low around 0.64 AUROC no matter which model is used, but the content embeddings raise performance above 0.89. The work also shows that common random data splits inflate results by up to 26 points and that simple tree models on the embeddings often beat graph neural networks. This matters because agents are rapidly gaining the ability to call external tools, creating new surfaces where attacks can hide in the content of those calls.

Core claim

MCPShield encodes each agent session as a graph with tool calls as nodes and sequential plus data-flow links as edges, then enriches the nodes with SBERT sentence-embedding features drawn from arguments and responses before classifying the session as benign or attacked. On task-stratified splits of RAS-Eval, metadata-only inputs plateau near 0.64 AUROC across GNNs, MLPs, and classical baselines, while adding content embeddings pushes the score above 0.89; tree ensembles on pooled embeddings reach 0.975 AUROC and outperform the GNNs (0.917) and MLP (0.896). Self-supervised pre-training of the embeddings yields no label-efficiency gain, and GraphSAGE is retained as the GNN baseline on ATBench.

What carries the argument

The session graph whose nodes are tool calls enriched with SBERT embeddings of arguments and responses, with edges capturing sequential order and data-flow dependencies.

If this is right

Content features from arguments and responses are required for high-accuracy detection; metadata alone is insufficient.
Evaluation must use task-disjoint splits to prevent memorization that inflates AUROC by as much as 26 points.
Tree ensembles applied to pooled SBERT embeddings can outperform graph neural networks on this task.
Self-supervised pre-training of the embeddings provides no measurable benefit for label efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simpler non-graph classifiers may be adequate and cheaper to deploy than full GNN pipelines for monitoring agent traffic.
The semantic content of tool interactions appears to carry the main attack signal, which could allow similar detectors to transfer to other agent tool protocols.
Real-time monitoring at the tool interface level might suffice without constructing full session graphs for every query.

Load-bearing premise

The attack examples crafted for RAS-Eval and ATBench are representative of the threats that would appear against real LLM agents.

What would settle it

A new collection of attack tool calls that differ in semantic patterns from those in the training benchmarks, tested on the same models, would falsify the claim if AUROC falls below 0.75.

Figures

Figures reproduced from arXiv: 2605.11053 by Sultan Zavrak.

**Figure 2.** Figure 2: Example MCP tool-call session showing a data exfiltration attack: the agent [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Session-to-graph encoding. Left: sequential tool calls. Right: the resulting [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-attack-category recall on the Combined dataset (GraphSAGE, content fea [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Label efficiency across datasets. Supervised training (circles) matches or per [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, MCPShield is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MCPShield, a detection framework for attacks on Model Context Protocol (MCP) tool-call traffic in LLM agents. Sessions are encoded as graphs (tool calls as nodes, sequential/data-flow links as edges) with nodes enriched by SBERT sentence embeddings over arguments and responses. It evaluates three GNNs (GAT, GCN, GraphSAGE), an MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) on RAS-Eval (task-stratified splits) and ATBench (label-stratified), plus a combined variant. Key claims: content features are essential (metadata-only AUROC ~0.64 vs. >0.89 with embeddings); random splits inflate AUROC by up to 26 points; the signal is primarily in SBERT embeddings, with tree ensembles on pooled embeddings reaching 0.975 AUROC and outperforming GNNs (0.917) and MLP (0.896) in the main setting; self-supervised pre-training shows no label-efficiency gain.

Significance. If the content-aware detection generalizes beyond the specific attack constructions in RAS-Eval and ATBench, the work would be significant for LLM agent security by establishing that metadata alone is insufficient and that proper task-stratified evaluation is required to avoid memorization artifacts. The explicit comparison across architectures and split types, plus the reproducible finding that simple tree models on pooled embeddings outperform graph neural networks, provides a useful benchmark and cautionary result for the field. The stratified-split methodology and architecture ablation are clear strengths.

major comments (2)

[§5] §5 (primary RAS-Eval results): The central claim that content embeddings capture generalizable attack signals (AUROC >0.89) is load-bearing, yet the superior performance of tree ensembles on pooled SBERT embeddings (0.975) over GNNs (0.917) and MLP (0.896) indicates the detection signal is largely lexical rather than structural. This raises the possibility that the large gap versus metadata-only (0.64) exploits dataset-specific artifacts in the injected attacks rather than intrinsic MCP threat patterns; a concrete test (e.g., evaluation on subtle argument-manipulation attacks preserving embedding proximity) is needed to support generalizability.
[Evaluation protocol] Evaluation protocol (RAS-Eval task-stratified vs. ATBench label-stratified): While the split-inflation finding is well-supported, the manuscript does not report whether the 0.975 AUROC for pooled trees holds under the stricter task-disjoint regime used for the GNN comparison, nor does it provide per-split variance or statistical tests on the AUROC differences; this weakens the cross-architecture and cross-dataset claims.

minor comments (2)

[Abstract] Abstract and §5: Reported AUROCs lack error bars, confidence intervals, or statistical significance tests on the differences (e.g., 0.975 vs. 0.917), which would make the architecture comparisons more robust.
[§4] Notation: The distinction between 'pooled embeddings' for the tree baselines and the node-level embeddings used in GNNs should be clarified with an explicit equation or diagram to avoid ambiguity in how the 0.975 result is obtained.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, clarifying our evaluation protocol and acknowledging limitations where appropriate. Revisions will be made to improve transparency on the lexical nature of the signal and to add statistical details.

read point-by-point responses

Referee: [§5] §5 (primary RAS-Eval results): The central claim that content embeddings capture generalizable attack signals (AUROC >0.89) is load-bearing, yet the superior performance of tree ensembles on pooled SBERT embeddings (0.975) over GNNs (0.917) and MLP (0.896) indicates the detection signal is largely lexical rather than structural. This raises the possibility that the large gap versus metadata-only (0.64) exploits dataset-specific artifacts in the injected attacks rather than intrinsic MCP threat patterns; a concrete test (e.g., evaluation on subtle argument-manipulation attacks preserving embedding proximity) is needed to support generalizability.

Authors: We agree that the 0.975 AUROC of tree ensembles on pooled embeddings demonstrates the signal is predominantly lexical within the SBERT content features, rather than structural. This is consistent with our central finding that content is essential (metadata-only AUROC ~0.64). The task-stratified splits on RAS-Eval were designed to reduce memorization of specific attacks or tasks, providing some support for generalizability within the evaluated attack constructions. However, we did not evaluate on subtle argument-manipulation attacks that preserve embedding proximity, which would be a valuable addition for broader claims. We will revise §5 and the discussion to explicitly note the lexical character of the signal and list this as a limitation for future work. revision: partial
Referee: [Evaluation protocol] Evaluation protocol (RAS-Eval task-stratified vs. ATBench label-stratified): While the split-inflation finding is well-supported, the manuscript does not report whether the 0.975 AUROC for pooled trees holds under the stricter task-disjoint regime used for the GNN comparison, nor does it provide per-split variance or statistical tests on the AUROC differences; this weakens the cross-architecture and cross-dataset claims.

Authors: We clarify that the 0.975 AUROC for pooled tree ensembles was obtained under the identical task-stratified (task-disjoint) splits used for the GNN and MLP comparisons in the primary RAS-Eval experiments; the full architecture ablation was run uniformly in this regime. We will add per-split variance (standard deviation across folds) and statistical tests (e.g., DeLong tests for AUROC differences and confidence intervals) to the revised tables and text. We will also include tree-ensemble results on ATBench to enable direct cross-dataset comparison under its label-stratified protocol. revision: yes

standing simulated objections not resolved

Evaluation on subtle argument-manipulation attacks preserving SBERT embedding proximity to test whether the content signal generalizes beyond the specific attack constructions in RAS-Eval and ATBench.

Circularity Check

0 steps flagged

No circularity: empirical results on held-out stratified splits

full rationale

The paper reports direct experimental comparisons of AUROC on task-stratified and label-stratified held-out splits of RAS-Eval and ATBench. Content vs. metadata feature ablations, GNN vs. tree-ensemble results, and the random-split inflation observation are all obtained by training and evaluating models on disjoint data partitions; none of the reported quantities are obtained by fitting a parameter to a subset and then relabeling the same quantity as a prediction. No equations, self-definitions, or load-bearing self-citations appear in the derivation chain that would reduce the central claims to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions about graph structure capturing tool-call dependencies and on the representational power of pre-trained sentence embeddings; no new entities are postulated.

free parameters (1)

SBERT embedding model
Pre-trained sentence transformer parameters are used as fixed feature extractors without reported fine-tuning.

axioms (2)

domain assumption Tool-call sessions can be faithfully represented as graphs with sequential and data-flow edges
Invoked when encoding sessions as graphs for GNN input.
domain assumption Attack labels in RAS-Eval and ATBench reflect genuine malicious behavior
Required for the classification task to be meaningful.

pith-pipeline@v0.9.0 · 5612 in / 1212 out tokens · 49159 ms · 2026-05-14T21:13:06.877944+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

GitHub topics: mcp-server,https://github.com/topics/ mcp-server, accessed: 2026-05-03 (2026)

work page 2026
[2]

Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, X. Li, MCPTox: A benchmark for tool poisoning attack on real-world MCP servers, arXiv preprint arXiv:2508.14925 (2025)

work page arXiv 2025
[3]

X. Zong, Z. Shen, L. Wang, Y. Lan, C. Yang, MCP-SafetyBench: Safety evaluation for LLMs with real-world MCP servers, in: International Conference on Learning Representations (ICLR), 2026

work page 2026
[4]

J. Kim, X. Liu, Z. Wang, S. Qiu, B. Li, W. Guo, D. Song, The attack and defense landscape of agentic AI: A comprehensive survey, arXiv preprint arXiv:2603.11088 (2026)

work page arXiv 2026
[5]

SoK: Reshaping Research on Network Intrusion Detection Systems

G. Apruzzese, Sok: Reshaping research on network intrusion detection systems, arXiv preprint arXiv:2604.17556 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

J. Liu, B. Ruan, X. Yang, Z. Lin, Y. Liu, Y. Wang, T. Wei, Z. Liang, TraceAegis: Securing LLM-based agents via hierarchical and behavioral anomaly detection, arXiv preprint arXiv:2510.11203 (2025)

work page arXiv 2025
[7]

Y.Liu, C.Zhang, Z.Han, H.Liu, Y.Wang, Y.Yu, X.Wang, Y.Yin, Tra- jAD: Trajectory anomaly detection for trustworthy LLM agents, arXiv preprint arXiv:2602.06443 (2026). 22

work page arXiv 2026
[8]

Z. Wang, J. Zhang, G. Shi, H. Cheng, Y. Yao, K. Guo, H. Du, X. Li, MindGuard: Tracking, detecting, and attributing MCP tool poisoning attack via decision dependence graph, arXiv preprint arXiv:2508.20412 (2025)

work page arXiv 2025
[9]

X. He, D. Wu, Y. Zhai, K. Sun, SentinelAgent: Graph-based anomaly detection in multi-agent systems, arXiv preprint arXiv:2505.24201 (2025)

work page arXiv 2025
[10]

Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

L. Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

work page arXiv 2026
[11]

R. F. Del Rosario, Temporal attack pattern detection in multi-agent AI workflows: An open framework for training trace-based security models, arXiv preprint arXiv:2601.00848 (2025)

work page arXiv 2025
[12]

Pathak, H

D. Pathak, H. Kumar, A. Roy, F. George, M. Verma, P. Moogi, De- tecting silent failures in multi-agentic AI trajectories, arXiv preprint arXiv:2511.04032 (2025)

work page arXiv 2025
[13]

W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, M. Portmann, E- GraphSAGE: A graph neural network based intrusion detection sys- tem for IoT, in: NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022

work page 2022
[14]

4355–4372

F.Yang, J.Xu, C.Xiong, Z.Li, K.Zhang, PROGRAPHER:Ananomaly detectionsystembasedonprovenancegraphembedding, in: Proceedings of the 32nd USENIX Security Symposium, 2023, pp. 4355–4372

work page 2023
[15]

Reimers, I

N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992

work page 2019
[16]

Veličković, G

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: International Conference on Learning Representations, 2018. 23

work page 2018
[17]

M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

work page 2019
[18]

Y. Fu, X. Yuan, D. Wang, RAS-Eval: A comprehensive benchmark for security evaluation of LLM agents in real-world environments, arXiv preprint arXiv:2506.15253 (2025)

work page arXiv 2025
[19]

Y. Li, H. Luo, Y. Xie, Y. Fu, Z. Yang, S. Shao, Q. Ren, W. Qu, Y. Fu, Y. Yang, J. Shao, X. Hu, D. Liu, ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, arXiv preprint arXiv:2604.02022 (2026). 24

work page internal anchor Pith review Pith/arXiv arXiv 2026