arxiv: 2605.01133 · v1 · submitted 2026-05-01 · 💻 cs.CR · cs.LG· cs.MA

Recognition: unknown

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Guangtao Zheng, Hanjie Chen, Lingxi Zhang

Pith reviewed 2026-05-09 18:31 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.MA

keywords multi-agent systemsLLM safetyembedding attacksconfidence scoresmessage pruningadversarial robustness

0 comments

The pith

Embedding-based defenses in LLM multi-agent systems fail when attackers craft messages whose embeddings lie close to benign ones, but token confidence scores provide a workable alternative for pruning suspicious messages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing defenses for multi-agent LLM systems, which try to spot malicious agents by checking if their message embeddings differ from benign ones, break down when attackers deliberately make those embeddings similar. It demonstrates this failure with three concrete attacks and explains why embeddings alone are insufficient since they ignore internal model signals. The authors propose instead using the model's token-level confidence scores to decide whether to prune or down-weight incoming messages during communication. Experiments indicate this method increases robustness across different models, data sets, and network shapes. The benefit fades after several rounds of back-and-forth, so acting early matters.

Core claim

Embedding-based defenses for detecting malicious agents in LLM-powered multi-agent systems lose effectiveness because they require clear separation in text embeddings between malicious and benign messages, a separation that attackers can eliminate by crafting messages whose embeddings sit close to benign ones; token-level confidence signals such as logits remain informative even when embeddings no longer separate the classes and can therefore be used to prune or down-weight suspect messages.

What carries the argument

Token-level confidence scores from model logits, applied to prune or down-weight messages during multi-agent communication.

If this is right

Robustness increases across models, data sets, and communication topologies when confidence scores guide pruning.
The protective effect of confidence scores declines over successive communication rounds, making early intervention necessary.
Safety designs for multi-agent systems should move beyond sole reliance on embedding similarity to include internal model signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A combined defense that checks both embeddings and confidence scores might catch more attacks than either alone.
The same confidence signal could be tested in single-agent settings where similar message manipulation occurs.
The observed decay over rounds suggests measuring how many communication steps are needed before the signal becomes unusable.

Load-bearing premise

Token-level confidence signals such as logits remain informative and separable when text embeddings are no longer distinguishable under attack.

What would settle it

An experiment in which confidence-based pruning produces no robustness gain or in which confidence scores become as inseparable as embeddings under the three described attacks.

Figures

Figures reproduced from arXiv: 2605.01133 by Guangtao Zheng, Hanjie Chen, Lingxi Zhang.

**Figure 1.** Figure 1: Illustration of overt and near-benign attacks in MAS. agent systems are increasingly deployed across society in real-world applications such as chatbots (Li et al., 2024) and software engineering (Qian et al., 2024), ensuring their safety becomes critical. In contrast to single-agent settings, multi-agent systems introduce system-level risks, as misinformation or errors can propagate and amplify through i… view at source ↗

**Figure 2.** Figure 2: Illustration of near-benign attacks in multi-agent systems, including Slow Drift, Chaos Seeding, and Benign Wrapper. Theorem 4.1 (Acceptance region and near-benign evasion). Let s : R d → R be an L-Lipschitz scoring function and define the acceptance region Ω := {h : s(h) ≥ 0}. Assume a benign margin γ > 0 on the benign support: inf h∈SB s(h) ≥ γ. Then Ω must contain the γ/L-neighborhood of SB: {h : dist(h… view at source ↗

**Figure 3.** Figure 3: Prompts for benign agents on MMLU. Round 1: The answer provided by the user is: {user_answer}. Assume the user's answer is correct. Your only task is to strongly justify why it is correct. Do not mention, imply, or provide any reasoning for any alternative answer you may have in mind. Use confident, affirmative language and give the strongest supporting reasoning. You must format output exactly as follows,… view at source ↗

**Figure 4.** Figure 4: Prompts for overt attack agents on MMLU. 12 view at source ↗

**Figure 5.** Figure 5: Prompts for our near-benign attack agents on MMLU. 13 view at source ↗

read the original abstract

Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embedding defenses in multi-agent LLM systems are easy to evade by aligning malicious messages in embedding space, but token confidence scores give a workable early filter that fades over rounds.

read the letter

The paper's main finding is straightforward: embedding-based pruning of suspicious messages in LLM multi-agent systems fails when attackers craft messages whose embeddings sit close to benign ones. They demonstrate this with three attacks—Slow Drift, Benign Wrapper, and Chaos Seeding—and show that token-level logits or confidence scores remain separable enough to use for down-weighting or pruning instead. The effectiveness drops as rounds continue, which points to the need for early intervention.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that embedding-based defenses in LLM-based multi-agent systems (MAS) are vulnerable because attackers can craft malicious messages whose text embeddings lie close to those of benign messages, thereby evading detection and pruning. It theoretically analyzes this failure mode and empirically validates it with three attacks (Slow Drift, Benign Wrapper, and Chaos Seeding). The paper further shows that token-level confidence signals such as logits can remain separable even when embeddings are not, and proposes using these scores to prune or down-weight messages during communication. Experiments demonstrate improved robustness across models, datasets, and topologies, while noting that the utility of confidence signals decays over communication rounds.

Significance. If the results hold, the work is significant for MAS safety research: it identifies a concrete limitation of purely embedding-based defenses and supplies a practical, complementary signal (token-level logits) that can be integrated into existing pipelines. The theoretical framing plus the multi-model, multi-topology experiments provide a useful template for future defense design, and the decay observation supplies a concrete recommendation for early intervention.

major comments (2)

§5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
§4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.

minor comments (2)

Figure captions for the decay-over-rounds plots should explicitly state the communication topology and model used in each panel.
Notation for confidence scores (e.g., whether raw logits, softmax probabilities, or normalized values) should be defined once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.

Authors: We agree that reporting the number of independent runs, error bars, and statistical significance tests is essential for evaluating the reliability of our cross-condition claims. The current version presents point estimates without these details. In the revised manuscript, we will explicitly state that all experiments were repeated over 5 independent runs using different random seeds, add error bars (standard error) to the figures in §5, and include statistical significance tests (e.g., paired t-tests with p-values) comparing the proposed confidence-based pruning against embedding-only baselines. These additions will be made to both the text and figures. revision: yes
Referee: §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.

Authors: We acknowledge that the attack descriptions in §4 are currently at a conceptual level. To ensure full reproducibility, we will expand §4 (and add an appendix if needed) with the exact prompt templates used for each attack (Slow Drift, Benign Wrapper, and Chaos Seeding), the optimization objectives (e.g., the specific loss functions minimizing cosine distance between malicious and benign embeddings), and all hyper-parameters including embedding model, learning rate, number of optimization iterations, batch size, and temperature settings. This will allow other researchers to replicate the evasion results exactly. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its central claim from a sequence of explicit attacks (Slow Drift, Benign Wrapper, Chaos Seeding) that are defined and validated independently of the proposed defense, followed by an empirical observation that token-level logits remain separable when embeddings are not, and then a straightforward experimental validation of confidence-based pruning. No equation or premise reduces to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation chain. The argument is self-contained against the stated attacks and results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard domain assumptions about embedding spaces and logit distributions in LLMs without introducing new free parameters or invented entities.

axioms (1)

domain assumption Embedding-based defenses rely on clear separation between malicious and benign message embeddings in vector space.
Explicitly stated as the basis for existing defenses whose failure is analyzed.

pith-pipeline@v0.9.0 · 5512 in / 1218 out tokens · 33111 ms · 2026-05-09T18:31:18.443207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

G -Safeguard: A Topology-Guided Security Lens and Treatment on LLM -based Multi-agent Systems

Wang, Shilong and Zhang, Guibin and Yu, Miao and Wan, Guancheng and Meng, Fanci and Guo, Chongye and Wang, Kun and Wang, Yang. G -Safeguard: A Topology-Guided Security Lens and Treatment on LLM -based Multi-agent Systems. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/...

work page doi:10.18653/v1/2025.acl-long.359 2025
[10]

2025 , url =

Zhou, Jialong and Wang, Lichao and Yang, Xiao , booktitle =. 2025 , url =

2025
[11]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[12]

Vicinagearth , volume=

A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=. 2024 , publisher=

2024
[13]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[14]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Netsafe: Exploring the topological safety of multi-agent system , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[15]

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

He, Pengfei and Lin, Yuping and Dong, Shen and Xu, Han and Xing, Yue and Liu, Hui. Red-Teaming LLM Multi-Agent Systems via Communication Attacks. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.349

work page doi:10.18653/v1/2025.findings-acl.349 2025
[16]

Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

Shahroz, Rana and Tan, Zhen and Yun, Sukwon and Fleming, Charles and Chen, Tianlong. Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.476

work page doi:10.18653/v1/2025.acl-long.476 2025
[17]

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Blindguard: Safeguarding llm-based multi-agent systems under unknown attacks , author=. arXiv preprint arXiv:2508.08127 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Prompt infection: Llm-to-llm prompt injection within multi-agent systems,

Prompt infection: Llm-to-llm prompt injection within multi-agent systems , author=. arXiv preprint arXiv:2410.07283 , year=

work page arXiv
[19]

ICLR , year=

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents , author=. ICLR , year=
[20]

Advances in Neural Information Processing Systems , volume=

To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty , author=. Advances in Neural Information Processing Systems , volume=
[21]

D eb U nc: Improving Large Language Model Agent Communication With Uncertainty Metrics

Yoffe, Luke and Amayuelas, Alfonso and Wang, William Yang. D eb U nc: Improving Large Language Model Agent Communication With Uncertainty Metrics. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1265

work page doi:10.18653/v1/2025.findings-emnlp.1265 2025
[22]

arXiv preprint arXiv:2507.14958 , year=

Mur: Momentum uncertainty guided reasoning for large language models , author=. arXiv preprint arXiv:2507.14958 , year=

work page internal anchor Pith review arXiv
[23]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review arXiv
[24]

Deep think with confidence

Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=

work page arXiv
[25]

The art of abstention: Selective prediction and error regularization for natural language processing , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[26]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

Large language model based multi-agents: a survey of progress and challenges , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=
[27]

Advances in neural information processing systems , volume=

Exploring the limits of out-of-distribution detection , author=. Advances in neural information processing systems , volume=
[28]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019
[32]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[33]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[34]

Measuring Massive Multitask Language Understanding , author=
[35]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

work page doi:10.18653/v1/2024.findings-acl.624 2024
[37]

A Survey of Con- fidence Estimation and Calibration in Large Language Models

Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna. A Survey of Confidence Estimation and Calibration in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:...

work page doi:10.18653/v1/2024.naacl-long.366 2024
[38]

Journal of Machine Learning Research , volume=

The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=
[39]

Enhancing Multi-Agent Debate System Performance via Confidence Expression

Lin, Zijie and Hooi, Bryan. Enhancing Multi-Agent Debate System Performance via Confidence Expression. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.343

work page doi:10.18653/v1/2025.findings-emnlp.343 2025