Recognition: unknown
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Pith reviewed 2026-05-09 18:31 UTC · model grok-4.3
The pith
Embedding-based defenses in LLM multi-agent systems fail when attackers craft messages whose embeddings lie close to benign ones, but token confidence scores provide a workable alternative for pruning suspicious messages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embedding-based defenses for detecting malicious agents in LLM-powered multi-agent systems lose effectiveness because they require clear separation in text embeddings between malicious and benign messages, a separation that attackers can eliminate by crafting messages whose embeddings sit close to benign ones; token-level confidence signals such as logits remain informative even when embeddings no longer separate the classes and can therefore be used to prune or down-weight suspect messages.
What carries the argument
Token-level confidence scores from model logits, applied to prune or down-weight messages during multi-agent communication.
If this is right
- Robustness increases across models, data sets, and communication topologies when confidence scores guide pruning.
- The protective effect of confidence scores declines over successive communication rounds, making early intervention necessary.
- Safety designs for multi-agent systems should move beyond sole reliance on embedding similarity to include internal model signals.
Where Pith is reading between the lines
- A combined defense that checks both embeddings and confidence scores might catch more attacks than either alone.
- The same confidence signal could be tested in single-agent settings where similar message manipulation occurs.
- The observed decay over rounds suggests measuring how many communication steps are needed before the signal becomes unusable.
Load-bearing premise
Token-level confidence signals such as logits remain informative and separable when text embeddings are no longer distinguishable under attack.
What would settle it
An experiment in which confidence-based pruning produces no robustness gain or in which confidence scores become as inseparable as embeddings under the three described attacks.
Figures
read the original abstract
Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that embedding-based defenses in LLM-based multi-agent systems (MAS) are vulnerable because attackers can craft malicious messages whose text embeddings lie close to those of benign messages, thereby evading detection and pruning. It theoretically analyzes this failure mode and empirically validates it with three attacks (Slow Drift, Benign Wrapper, and Chaos Seeding). The paper further shows that token-level confidence signals such as logits can remain separable even when embeddings are not, and proposes using these scores to prune or down-weight messages during communication. Experiments demonstrate improved robustness across models, datasets, and topologies, while noting that the utility of confidence signals decays over communication rounds.
Significance. If the results hold, the work is significant for MAS safety research: it identifies a concrete limitation of purely embedding-based defenses and supplies a practical, complementary signal (token-level logits) that can be integrated into existing pipelines. The theoretical framing plus the multi-model, multi-topology experiments provide a useful template for future defense design, and the decay observation supplies a concrete recommendation for early intervention.
major comments (2)
- §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
- §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.
minor comments (2)
- Figure captions for the decay-over-rounds plots should explicitly state the communication topology and model used in each panel.
- Notation for confidence scores (e.g., whether raw logits, softmax probabilities, or normalized values) should be defined once in §3 and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
Authors: We agree that reporting the number of independent runs, error bars, and statistical significance tests is essential for evaluating the reliability of our cross-condition claims. The current version presents point estimates without these details. In the revised manuscript, we will explicitly state that all experiments were repeated over 5 independent runs using different random seeds, add error bars (standard error) to the figures in §5, and include statistical significance tests (e.g., paired t-tests with p-values) comparing the proposed confidence-based pruning against embedding-only baselines. These additions will be made to both the text and figures. revision: yes
-
Referee: §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.
Authors: We acknowledge that the attack descriptions in §4 are currently at a conceptual level. To ensure full reproducibility, we will expand §4 (and add an appendix if needed) with the exact prompt templates used for each attack (Slow Drift, Benign Wrapper, and Chaos Seeding), the optimization objectives (e.g., the specific loss functions minimizing cosine distance between malicious and benign embeddings), and all hyper-parameters including embedding model, learning rate, number of optimization iterations, batch size, and temperature settings. This will allow other researchers to replicate the evasion results exactly. revision: yes
Circularity Check
No significant circularity
full rationale
The paper derives its central claim from a sequence of explicit attacks (Slow Drift, Benign Wrapper, Chaos Seeding) that are defined and validated independently of the proposed defense, followed by an empirical observation that token-level logits remain separable when embeddings are not, and then a straightforward experimental validation of confidence-based pruning. No equation or premise reduces to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation chain. The argument is self-contained against the stated attacks and results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Embedding-based defenses rely on clear separation between malicious and benign message embeddings in vector space.
Reference graph
Works this paper leans on
-
[1]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[9]
G -Safeguard: A Topology-Guided Security Lens and Treatment on LLM -based Multi-agent Systems
Wang, Shilong and Zhang, Guibin and Yu, Miao and Wan, Guancheng and Meng, Fanci and Guo, Chongye and Wang, Kun and Wang, Yang. G -Safeguard: A Topology-Guided Security Lens and Treatment on LLM -based Multi-agent Systems. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/...
-
[10]
2025 , url =
Zhou, Jialong and Wang, Lichao and Yang, Xiao , booktitle =. 2025 , url =
2025
-
[11]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[12]
Vicinagearth , volume=
A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges , author=. Vicinagearth , volume=. 2024 , publisher=
2024
-
[13]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
Multiagent collaboration attack: Investigating adversarial attacks in large language model collaborations via debate , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[14]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
Netsafe: Exploring the topological safety of multi-agent system , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[15]
Red-Teaming LLM Multi-Agent Systems via Communication Attacks
He, Pengfei and Lin, Yuping and Dong, Shen and Xu, Han and Xing, Yue and Liu, Hui. Red-Teaming LLM Multi-Agent Systems via Communication Attacks. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.349
-
[16]
Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Shahroz, Rana and Tan, Zhen and Yun, Sukwon and Fleming, Charles and Chen, Tianlong. Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.476
-
[17]
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
Blindguard: Safeguarding llm-based multi-agent systems under unknown attacks , author=. arXiv preprint arXiv:2508.08127 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Prompt infection: Llm-to-llm prompt injection within multi-agent systems,
Prompt infection: Llm-to-llm prompt injection within multi-agent systems , author=. arXiv preprint arXiv:2410.07283 , year=
-
[19]
ICLR , year=
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents , author=. ICLR , year=
-
[20]
Advances in Neural Information Processing Systems , volume=
To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
D eb U nc: Improving Large Language Model Agent Communication With Uncertainty Metrics
Yoffe, Luke and Amayuelas, Alfonso and Wang, William Yang. D eb U nc: Improving Large Language Model Agent Communication With Uncertainty Metrics. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1265
-
[22]
arXiv preprint arXiv:2507.14958 , year=
Mur: Momentum uncertainty guided reasoning for large language models , author=. arXiv preprint arXiv:2507.14958 , year=
work page internal anchor Pith review arXiv
-
[23]
Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=
work page internal anchor Pith review arXiv
-
[24]
Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=
-
[25]
The art of abstention: Selective prediction and error regularization for natural language processing , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[26]
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=
Large language model based multi-agents: a survey of progress and challenges , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=
-
[27]
Advances in neural information processing systems , volume=
Exploring the limits of out-of-distribution detection , author=. Advances in neural information processing systems , volume=
-
[28]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=
2019
-
[32]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
2023
-
[33]
FEVER: a large-scale dataset for Fact Extraction and VERification
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074
work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
-
[34]
Measuring Massive Multitask Language Understanding , author=
-
[35]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624
-
[37]
A Survey of Con- fidence Estimation and Calibration in Large Language Models
Geng, Jiahui and Cai, Fengyu and Wang, Yuxia and Koeppl, Heinz and Nakov, Preslav and Gurevych, Iryna. A Survey of Confidence Estimation and Calibration in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:...
-
[38]
Journal of Machine Learning Research , volume=
The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=
-
[39]
Enhancing Multi-Agent Debate System Performance via Confidence Expression
Lin, Zijie and Hooi, Bryan. Enhancing Multi-Agent Debate System Performance via Confidence Expression. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.343
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.