hub

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy · 2014 · stat.ML · arXiv 1412.6572

88 Pith papers cite this work. Polarity classification is still indexing.

88 Pith papers citing it

open full Pith review browse 88 citing papers arXiv PDF

abstract

Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

math.OC · 2026-05-09 · unverdicted · novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing

cs.CR · 2026-04-08 · unverdicted · novelty 8.0

Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.

TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

cs.LG · 2026-05-07 · conditional · novelty 7.0

Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training under adaptive evaluation.

Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

stat.ME · 2026-05-02 · unverdicted · novelty 7.0

MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.

Decision Boundary-aware Generation for Long-tailed Learning

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

quant-ph · 2026-05-01 · unverdicted · novelty 7.0

QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.

From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Duality for the Adversarial Total Variation

math.AP · 2026-04-20 · unverdicted · novelty 7.0

Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning

quant-ph · 2026-04-16 · unverdicted · novelty 7.0

Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry sectors substantially improves robustness.

Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Learning Robustness at Test-Time from a Non-Robust Teacher

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

Efficient Unlearning through Maximizing Relearning Convergence Delay

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.

citing papers explorer

Showing 38 of 88 citing papers.

IPRU: Input-Perturbation-based Radio Frequency Fingerprinting Unlearning for LAWNs eess.SP · 2026-04-27 · unverdicted · none · ref 15 · internal anchor
IPRU erases target AAV radio fingerprints via an optimized input perturbation vector, delivering 1.41% unlearning accuracy, 99.41% remaining accuracy, full membership-inference resistance, and 5.79X speedup over retraining.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 59 · internal anchor
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition cs.CL · 2026-04-25 · unverdicted · none · ref 3 · internal anchor
GTKA uses adversarial game training to generate privacy-safe sub-queries for external LLMs, then integrates answers locally, reducing intent leakage while preserving answer quality on new biomedical and legal benchmarks.
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts cs.SE · 2026-04-25 · unverdicted · none · ref 23 · internal anchor
A broad empirical benchmark shows how 15 existing test selection metrics perform for fault detection, performance estimation, and retraining under corrupted, adversarial, temporal, natural, and label shifts across image, text, and Android data.
Ethics Testing: Proactive Identification of Generative AI System Harms cs.SE · 2026-04-23 · unverdicted · none · ref 27 · internal anchor
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods cs.CV · 2026-04-22 · conditional · none · ref 25 · internal anchor
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics cs.LG · 2026-04-22 · unverdicted · none · ref 37 · internal anchor
An LLM-guided framework simulates physiological trajectories to provide interpretable early warnings for sepsis, achieving AUC scores of 0.861-0.903 on MIMIC-IV and eICU data.
When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide cs.CV · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
A new reliability score computed from the IoU difference between class-specific and class-agnostic heatmaps, boosted by adversarial enhancement, detects false negatives in binary industrial defect detectors with up to 100% recall.
Can AI Detect Life? Lessons from Artificial Life cs.LG · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
Artificial life experiments demonstrate that machine learning models for extraterrestrial life detection produce near-100% false positives on out-of-distribution samples, rendering them unreliable.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 18 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
Detecting Diffusion-generated Images via Dynamic Assembly Forests cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
DAF is a novel deep forest-based detector for diffusion-generated images that uses fewer parameters and less computation than DNN methods while matching their performance.
Quantum Patches: Enhancing Robustness of Quantum Machine Learning Models quant-ph · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
Random quantum circuits used as adversarial training data reduce successful attack rates on QML models for CIFAR-10 from 89.8% to 68.45% and for CINIC-10 from 94.23% to 78.68%.
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation cs.CV · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization cs.LG · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.
Compression as an Adversarial Amplifier Through Decision Space Reduction cs.CV · 2026-04-08 · unverdicted · none · ref 20 · internal anchor
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models cs.CR · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 5 · internal anchor
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
Measuring Representation Robustness in Large Language Models for Geometry cs.CL · 2026-04-03 · unverdicted · none · ref 11 · internal anchor
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.
Street-Legal Physical-World Adversarial Rim for License Plates cs.CV · 2026-04-02 · conditional · none · ref 10 · internal anchor
SPAR is a street-legal physical rim that cuts modern ALPR accuracy by 60% and reaches 18% targeted impersonation while costing under $100 and requiring no plate modification.
Safety, Security, and Cognitive Risks in World Models cs.CR · 2026-04-01 · unverdicted · none · ref 22 · internal anchor
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.
Jailbreaking Black Box Large Language Models in Twenty Queries cs.LG · 2023-10-12 · conditional · none · ref 40 · internal anchor
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 67 · internal anchor
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 19 · internal anchor
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima cs.LG · 2016-09-15 · unverdicted · none · ref 4 · internal anchor
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
Medical Model Synthesis Architectures: A Case Study cs.AI · 2026-05-10 · unverdicted · none · ref 180 · internal anchor
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
Laundering AI Authority with Adversarial Examples cs.CR · 2026-05-05 · unverdicted · none · ref 23 · internal anchor
Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
Machine Learning Enhanced Laser Spectroscopy for Multi-Species Gas Detection in Complex and Harsh Environments physics.optics · 2026-05-02 · unverdicted · none · ref 194 · internal anchor
Machine learning methods including denoising autoencoders, unsupervised interference mitigation, blind source separation, and certifiable classification are developed and experimentally validated to improve multi-species laser spectroscopy under complex conditions.
Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving cs.CV · 2026-04-26 · unverdicted · none · ref 36 · internal anchor
AFM is a novel gray-box adversarial attack using flow matching to create visually imperceptible perturbations that degrade performance of Vision-Language-Action and modular end-to-end autonomous driving models while showing strong cross-model transferability.
UniAda: Universal Adaptive Multi-objective Adversarial Attack for End-to-End Autonomous Driving Systems cs.SE · 2026-04-25 · unverdicted · none · ref 15 · internal anchor
UniAda introduces a white-box multi-objective attack using adaptive weighting to generate perturbations that jointly affect steering and speed in E2E ADS, outperforming benchmarks with average deviations of 3.54-29 degrees and 11-22 km/h.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits cs.CR · 2026-04-13 · unverdicted · none · ref 15 · internal anchor
Hybrid quantum-classical models using structured entanglement keep high accuracy on MNIST, OrganAMNIST and CIFAR-10 while lowering adversarial attack success rates and raising the computational cost of generating attacks.
Real-Time Evaluation of Autonomous Systems under Adversarial Attacks cs.AI · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
A framework trains and compares MLP, transformer, and GAIL-based trajectory models on real driving data, finding that architectural differences cause large variations in robustness to PGD attacks despite similar nominal accuracy.
Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models cs.CV · 2026-04-16 · unverdicted · none · ref 21 · internal anchor
Perceptual quality metrics correlate strongly with each other but show minimal correlation with attack success rate across medical imaging models and datasets, making ASR alone inadequate for assessing adversarial robustness.
Security and Resilience in Autonomous Vehicles: A Proactive Design Approach cs.CR · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
Presents an AV Resilient architecture with redundancy, diversity, adaptive reconfiguration, and anomaly- and hash-based intrusion detection, experimentally validated on the Quanser QCar platform for detecting depth camera blinding and perception module tampering.
Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection cs.CV · 2026-04-12 · unverdicted · none · ref 10 · internal anchor
The proposed steganography-based attribution system with CLIP multimodal fusion achieves robust watermarking under distortions and 0.99 AUC-ROC for harm detection, enabling traceable AI content accountability.
Adversarial Robustness Analysis of Cloud-Assisted Autonomous Driving Systems cs.RO · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
Adversarial attacks on cloud perception models plus network impairments in a vehicle-cloud loop degrade object detection from 0.73/0.68 to 0.22/0.15 precision/recall and destabilize closed-loop vehicle control.
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions cs.LG · 2026-05-12 · accept · none · ref 37 · internal anchor
NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.
Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach cs.CR · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and transfer attacks on UNSW-NB15.

Explaining and Harnessing Adversarial Examples

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer