First adversarially robust data structure for c-approximate furthest neighbor search with query time matching the best known oblivious results for many parameter regimes.
super hub Canonical reference
Explaining and Harnessing Adversarial Examples
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving
authors
co-cited works
representative citing papers
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
Introduces first PPH for ℓ1-distance predicate with O(t²) runtime to force significant noise in adversarial image attacks.
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
Dataset distillation creates a tiny synthetic training set that, when used with a fixed network initialization, produces models whose performance approximates that of models trained on the full original dataset.
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
Vision models show a phase transition where accuracy on a zero syntactic-distance global-semantics task collapses to chance beyond a critical scale and does not recover with larger models or data.
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
GOMA achieves optimal last-iterate O(1/k²) convergence in deterministic monotone Lipschitz VIs and O(1/√k) in stochastic unbounded-variance settings without variance reduction.
RogueMerge is a unified attack method that jointly optimizes task vectors to succeed after merging, using stochastic min-max simulation for unknown merging settings and a Taylor-approximated DRO for prompt generalization on generative LLMs.
RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
IBAL framework constructs information-theoretic adversarial attacks on agent observations and actions to train MARL agents that remain robust to interaction disruptions and agent-missing scenarios.
MARS is a transfer-based black-box attack that uses bi-level optimization on semantic and artifact anchors to escape the linearity trap and improve attack success rates on SSL-SVDD by up to 36%.
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
QLL is a novel logic for neuro-symbolic learning that uses ML-native operations (sum, log-sum-exp) on logits to embed constraints, satisfying most linear logic properties and showing stronger correlation between empirical robustness and formal verification than prior approaches.
Presents first online L2D algorithm for multiclass classification with bandit feedback and varying experts, achieving O((n+n_e)T^{2/3}) regret generally and O((n+n_e)√T) under low noise.
A neurosymbolic model augments Swin Transformers with focal sets and fuzzy logic to produce calibrated hierarchical image classifications that respect logical constraints.
TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.
citing papers explorer
-
Adversarially Robust Approximate Furthest Neighbor
First adversarially robust data structure for c-approximate furthest neighbor search with query time matching the best known oblivious results for many parameter regimes.
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
-
Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
-
Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing
Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.
-
Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
-
Property-Preserving Hashing for $\ell_1$-Distance Predicates: Applications to Countering Adversarial Input Attacks
Introduces first PPH for ℓ1-distance predicate with O(t²) runtime to force significant noise in adversarial image attacks.
-
Universal and Transferable Adversarial Attacks on Aligned Language Models
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
-
Dataset Distillation
Dataset distillation creates a tiny synthetic training set that, when used with a fixed network initialization, produces models whose performance approximates that of models trained on the full original dataset.
-
A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
-
Can Machines Really See Objects in Images? A Study Based on Syntactic Distance and Visual Self-Referential Instances
Vision models show a phase transition where accuracy on a zero syntactic-distance global-semantics task collapses to chance beyond a critical scale and does not recover with larger models or data.
-
MIRAGE: Protecting against Malicious Image Editing via False Moderation
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
-
Accelerated and Stable Convergence with Anchored Optimistic Method
GOMA achieves optimal last-iterate O(1/k²) convergence in deterministic monotone Lipschitz VIs and O(1/√k) in stochastic unbounded-variance settings without variance reduction.
-
RogueMerge: Robust and Unified Attacks against LLM Model Merging
RogueMerge is a unified attack method that jointly optimizes task vectors to succeed after merging, using stochastic min-max simulation for unknown merging settings and a Taylor-approximated DRO for prompt generalization on generative LLMs.
-
RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations
RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
-
When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
-
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
-
Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning
IBAL framework constructs information-theoretic adversarial attacks on agent observations and actions to train MARL agents that remain robust to interaction disruptions and agent-missing scenarios.
-
Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection
MARS is a transfer-based black-box attack that uses bi-level optimization on semantic and artifact anchors to escape the linearity trap and improve attack success rates on SSL-SVDD by up to 36%.
-
AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
-
$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.
-
Quantitative Linear Logic for Neuro-Symbolic Learning and Verification
QLL is a novel logic for neuro-symbolic learning that uses ML-native operations (sum, log-sum-exp) on logits to embed constraints, satisfying most linear logic properties and showing stronger correlation between empirical robustness and formal verification than prior approaches.
-
Online Learning-to-Defer with Varying Experts
Presents first online L2D algorithm for multiclass classification with bandit feedback and varying experts, achieving O((n+n_e)T^{2/3}) regret generally and O((n+n_e)√T) under low noise.
-
A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification
A neurosymbolic model augments Swin Transformers with focal sets and fuzzy logic to produce calibrated hierarchical image classifications that respect logical constraints.
-
TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers
TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics
Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.
-
Empirical Evidence for Simply Connected Decision Regions in Image Classifiers
Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
TAGO performs sparse jailbreak optimization on audio LMs by retaining only high-gradient-energy tokens, preserving near-full ASR at 25% retention across three models.
-
Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
-
Decision Boundary-aware Generation for Long-tailed Learning
DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.
-
Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
-
From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models
An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.
-
Low Rank Adaptation for Adversarial Perturbation
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Benign Overfitting in Adversarial Training for Vision Transformers
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Duality for the Adversarial Total Variation
Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.
-
Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning
Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry sectors substantially improves robustness.
-
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
-
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
-
Learning Robustness at Test-Time from a Non-Robust Teacher
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
-
Efficient Unlearning through Maximizing Relearning Convergence Delay
The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.
-
On the Decompositionality of Neural Networks
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
-
Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats
A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
-
SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
SCOOTER supplies best-practice guidelines, open tools, and a 3K-image benchmark with 34K+ human ratings showing that six tested unrestricted attacks produce images humans can detect as fake.
-
Experimental robustness benchmarking of quantum neural networks on a superconducting quantum processor
Experimental runs on a superconducting quantum processor demonstrate that 20-qubit quantum neural networks are more resistant to adversarial attacks than classical networks, with adversarial training further improving robustness and empirical bounds closely matching theory.
-
LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation
LatentStealth is a latent-space optimization method that produces imperceptible yet effective adversarial attacks on expressive human pose and shape estimation models using few output queries.