hub

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy · 2014 · stat.ML · arXiv 1412.6572

88 Pith papers cite this work. Polarity classification is still indexing.

88 Pith papers citing it

open full Pith review browse 88 citing papers arXiv PDF

abstract

Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

math.OC · 2026-05-09 · unverdicted · novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing

cs.CR · 2026-04-08 · unverdicted · novelty 8.0

Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.

TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

cs.LG · 2026-05-07 · conditional · novelty 7.0

Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.

Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

stat.ME · 2026-05-02 · unverdicted · novelty 7.0

MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.

Decision Boundary-aware Generation for Long-tailed Learning

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

quant-ph · 2026-05-01 · unverdicted · novelty 7.0

QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.

From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Duality for the Adversarial Total Variation

math.AP · 2026-04-20 · unverdicted · novelty 7.0

Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning

quant-ph · 2026-04-16 · unverdicted · novelty 7.0

Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry sectors substantially improves robustness.

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

Efficient Unlearning through Maximizing Relearning Convergence Delay

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Inference Time Causal Probing in LLMs cs.AI · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements cs.AI · 2026-04-02 · unverdicted · none · ref 15 · internal anchor
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 63 · internal anchor
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours cs.AI · 2026-05-05 · unverdicted · none · ref 1 · internal anchor
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
Medical Model Synthesis Architectures: A Case Study cs.AI · 2026-05-10 · unverdicted · none · ref 180 · internal anchor
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
Real-Time Evaluation of Autonomous Systems under Adversarial Attacks cs.AI · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
A framework trains and compares MLP, transformer, and GAIL-based trajectory models on real driving data, finding that architectural differences cause large variations in robustness to PGD attacks despite similar nominal accuracy.

Explaining and Harnessing Adversarial Examples

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer