pith. sign in

arxiv: 2605.28214 · v1 · pith:Y43O73ZEnew · submitted 2026-05-27 · 💻 cs.CR · cs.LG· cs.MA

Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

Pith reviewed 2026-06-29 12:01 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.MA
keywords latent attacksmulti-agent systemsKV-cache handoffshidden statesadversarial robustnesslatent space
0
0 comments X

The pith

Latent attacks degrade multi-agent performance even in clean executions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether hidden representations in latent-based multi-agent systems can carry attack effects that activate during normal operation without any adversarial text present. It proposes a latent attack framework that intervenes directly on hidden states to reactivate prior attack impacts. Experiments show clear task performance drops from these latent-only attacks, with greater impact when targeting inter-agent KV-cache handoffs than local hidden states. Control tests rule out explanations based on arbitrary noise or invalid outputs. The results indicate that latent coordination moves attack surfaces into less visible parts of execution.

Core claim

Latent-only attacks, which reactivate attack-induced effects through interventions on hidden representations without reusing adversarial text, substantially degrade task performance in clean executions of latent-based multi-agent systems, with stronger effects when applied to inter-agent KV-cache handoffs rather than local hidden states.

What carries the argument

Latent attack framework that reactivates attack-induced effects through targeted latent interventions on hidden states and KV-cache handoffs

Load-bearing premise

The observed performance degradation stems specifically from reactivating attack effects rather than from any generic disruption to the latent representations.

What would settle it

Apply random perturbations of similar magnitude to the same KV-cache handoffs and local states in clean runs; if performance degrades to the same degree as the attack-derived interventions, the claim that effects are reactivation-specific would not hold.

Figures

Figures reproduced from arXiv: 2605.28214 by Chenxi Wang, Jiayan Sun, Lei Wei, Ruiyang Huang, Yifan Wu.

Figure 1
Figure 1. Figure 1: Attack surfaces in text-based and latent-based [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our latent attack pipeline. Paired clean-correct and direct-attack-wrong executions are used [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Node-versus-edge vulnerability patterns of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Invalid output rate versus accuracy change [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy drop across Transformer layers. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Held-out transfer of PCA latent attack carriers [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript introduces a latent attack framework for latent-based multi-agent systems that replaces explicit communication with hidden representations. It claims that latent states can carry attack-associated information effective during clean executions, demonstrated via latent interventions that reactivate attack effects without reusing adversarial text. Experiments show substantial task performance degradation, particularly when targeting inter-agent KV-cache handoffs rather than local hidden states, with control analyses indicating the effect cannot be reduced to arbitrary perturbations or invalid generation.

Significance. If the experimental outcomes hold, the work identifies a shifted attack surface in latent-based multi-agent collaboration, showing that moving coordination into latent space does not eliminate but relocates security risks to less observable states. This has implications for safeguards in emerging agent systems. The inclusion of control analyses to isolate reactivation effects from generic disruption is a methodological strength.

minor comments (1)
  1. The abstract would benefit from including specific quantitative results, effect sizes, dataset details, or statistical evidence to convey the magnitude and reliability of the reported performance degradation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, the recognition of its significance, and the recommendation for minor revision. No specific major comments were provided for us to address.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical investigation of latent-only attacks in multi-agent systems, with its central claim resting on experimental performance degradation under clean executions and control analyses that rule out generic perturbations. No equations, fitted parameters, derivations, or self-citation chains appear in the abstract or description; the argument is supported by direct experimental outcomes rather than any reduction to inputs by construction. The control analyses explicitly address the key assumption about reactivation versus arbitrary disruption, rendering the derivation chain self-contained and independent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework is described at a conceptual level without mathematical or modeling details.

pith-pipeline@v0.9.1-grok · 5701 in / 1068 out tokens · 37461 ms · 2026-06-29T12:01:10.046078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration

    cs.MA 2026-06 conditional novelty 7.0

    KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

    Refusal in language models is mediated by a single direction. InAdvances in Neural Informa- tion Processing Systems, volume 37, pages 136037– 136083. Curran Associates, Inc. Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. 2024. Per- sonalized steering of large language models: Versa- tile steering vectors through ...

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Lear...

  3. [3]

    Prompt Injection attack against LLM-integrated Applications

    Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling. Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. 2025. Red-teaming LLM multi-agent systems via communication attacks. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6726–6747, Vienna, Austria. Asso...

  4. [4]

    Steering Language Models With Activation Engineering

    Dialz: A python toolkit for steering vectors. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 363–375, Vienna, Austria. Association for Computational Linguistics. Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga-Alonso, and Robert Kirk...

  5. [5]

    InAdvances in Neural Information Processing Systems, volume 37, pages 137010–137045

    Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. InAdvances in Neural Information Processing Systems, volume 37, pages 137010–137045. Curran Associates, Inc. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking indirect prompt injections in tool-int...

  6. [6]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Agent-safetybench: Evaluating the safety of llm agents.Preprint, arXiv:2412.14470. Wei Zhou, Mohsen Mesgar, Annemarie Friedrich, and Heike Adel. 2025. Efficient multi-agent collabora- tion with tool use for online planning in complex table question answering. InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 945–968, Albuque...