When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Pith reviewed 2026-05-09 18:31 UTC · model grok-4.3
The pith
Embedding-based defenses in LLM multi-agent systems fail when attackers craft messages whose embeddings lie close to benign ones, but token confidence scores provide a workable alternative for pruning suspicious messages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embedding-based defenses for detecting malicious agents in LLM-powered multi-agent systems lose effectiveness because they require clear separation in text embeddings between malicious and benign messages, a separation that attackers can eliminate by crafting messages whose embeddings sit close to benign ones; token-level confidence signals such as logits remain informative even when embeddings no longer separate the classes and can therefore be used to prune or down-weight suspect messages.
What carries the argument
Token-level confidence scores from model logits, applied to prune or down-weight messages during multi-agent communication.
If this is right
- Robustness increases across models, data sets, and communication topologies when confidence scores guide pruning.
- The protective effect of confidence scores declines over successive communication rounds, making early intervention necessary.
- Safety designs for multi-agent systems should move beyond sole reliance on embedding similarity to include internal model signals.
Where Pith is reading between the lines
- A combined defense that checks both embeddings and confidence scores might catch more attacks than either alone.
- The same confidence signal could be tested in single-agent settings where similar message manipulation occurs.
- The observed decay over rounds suggests measuring how many communication steps are needed before the signal becomes unusable.
Load-bearing premise
Token-level confidence signals such as logits remain informative and separable when text embeddings are no longer distinguishable under attack.
What would settle it
An experiment in which confidence-based pruning produces no robustness gain or in which confidence scores become as inseparable as embeddings under the three described attacks.
Figures
read the original abstract
Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that embedding-based defenses in LLM-based multi-agent systems (MAS) are vulnerable because attackers can craft malicious messages whose text embeddings lie close to those of benign messages, thereby evading detection and pruning. It theoretically analyzes this failure mode and empirically validates it with three attacks (Slow Drift, Benign Wrapper, and Chaos Seeding). The paper further shows that token-level confidence signals such as logits can remain separable even when embeddings are not, and proposes using these scores to prune or down-weight messages during communication. Experiments demonstrate improved robustness across models, datasets, and topologies, while noting that the utility of confidence signals decays over communication rounds.
Significance. If the results hold, the work is significant for MAS safety research: it identifies a concrete limitation of purely embedding-based defenses and supplies a practical, complementary signal (token-level logits) that can be integrated into existing pipelines. The theoretical framing plus the multi-model, multi-topology experiments provide a useful template for future defense design, and the decay observation supplies a concrete recommendation for early intervention.
major comments (2)
- §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
- §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.
minor comments (2)
- Figure captions for the decay-over-rounds plots should explicitly state the communication topology and model used in each panel.
- Notation for confidence scores (e.g., whether raw logits, softmax probabilities, or normalized values) should be defined once in §3 and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: §5 (Experiments and Evaluation): the abstract states that experiments show improved robustness across models, datasets, and topologies, yet no mention is made of the number of independent runs, error bars, or statistical significance tests; without these the cross-condition claims rest on point estimates whose reliability cannot be assessed.
Authors: We agree that reporting the number of independent runs, error bars, and statistical significance tests is essential for evaluating the reliability of our cross-condition claims. The current version presents point estimates without these details. In the revised manuscript, we will explicitly state that all experiments were repeated over 5 independent runs using different random seeds, add error bars (standard error) to the figures in §5, and include statistical significance tests (e.g., paired t-tests with p-values) comparing the proposed confidence-based pruning against embedding-only baselines. These additions will be made to both the text and figures. revision: yes
-
Referee: §4 (Attack Construction): the three attacks are introduced at a high level; the manuscript should supply the precise prompt templates, optimization objectives, or hyper-parameters used to align malicious embeddings with benign ones so that the evasion results are reproducible by other researchers.
Authors: We acknowledge that the attack descriptions in §4 are currently at a conceptual level. To ensure full reproducibility, we will expand §4 (and add an appendix if needed) with the exact prompt templates used for each attack (Slow Drift, Benign Wrapper, and Chaos Seeding), the optimization objectives (e.g., the specific loss functions minimizing cosine distance between malicious and benign embeddings), and all hyper-parameters including embedding model, learning rate, number of optimization iterations, batch size, and temperature settings. This will allow other researchers to replicate the evasion results exactly. revision: yes
Circularity Check
No significant circularity
full rationale
The paper derives its central claim from a sequence of explicit attacks (Slow Drift, Benign Wrapper, Chaos Seeding) that are defined and validated independently of the proposed defense, followed by an empirical observation that token-level logits remain separable when embeddings are not, and then a straightforward experimental validation of confidence-based pruning. No equation or premise reduces to a self-definition, a fitted parameter relabeled as a prediction, or a load-bearing self-citation chain. The argument is self-contained against the stated attacks and results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Embedding-based defenses rely on clear separation between malicious and benign message embeddings in vector space.
Forward citations
Cited by 1 Pith paper
-
When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration
KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.