The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

· 2025 · cs.LG · arXiv 2505.20435

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine--tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

representative citing papers

Patnaik-Pearson intrinsic dimension for internal representations of neural networks

math.ST · 2026-06-17 · unverdicted · novelty 6.0 · 2 refs

Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Patnaik-Pearson intrinsic dimension for internal representations of neural networks math.ST · 2026-06-17 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

fields

years

verdicts

representative citing papers

citing papers explorer