Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
Adversarial robustness as a prior for learned representations
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.
CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume channel selection algorithm.
Adversarial distillation improves student robustness when teachers show high uncertainty on robustly unlearnable samples, suppressing noise memorization and allowing reliance on learnable robust signals.
Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
citing papers explorer
-
Toy Models of Superposition
Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
-
Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models
High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.
-
Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers
CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume channel selection algorithm.
-
Toward Understanding Adversarial Distillation: Why Robust Teachers Fail
Adversarial distillation improves student robustness when teachers show high uncertainty on robustly unlearnable samples, suppressing noise memorization and allowing reliance on learnable robust signals.
-
Laundering AI Authority with Adversarial Examples
Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.