Adversarial Machine Learning at Scale

Alexey Kurakin , Ian Goodfellow , Samy Bengio

Authors on Pith no claims yet

classification 💻 cs.CV cs.CRcs.LGstat.ML

keywords adversarialtrainingattackexamplesmodelattacksmethodsmodels

read the original abstract

Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model's parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet. Our contributions include: (1) recommendations for how to succesfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than single-step attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantum Patches: Enhancing Robustness of Quantum Machine Learning Models
quant-ph 2026-04 unverdicted novelty 6.0

Random quantum circuits used as adversarial training data reduce successful attack rates on QML models for CIFAR-10 from 89.8% to 68.45% and for CINIC-10 from 94.23% to 78.68%.
UniAda: Universal Adaptive Multi-objective Adversarial Attack for End-to-End Autonomous Driving Systems
cs.SE 2026-04 unverdicted novelty 5.0

UniAda introduces a white-box multi-objective attack using adaptive weighting to generate perturbations that jointly affect steering and speed in E2E ADS, outperforming benchmarks with average deviations of 3.54-29 de...
Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models
cs.CV 2026-04 unverdicted novelty 4.0

Perceptual quality metrics correlate strongly with each other but show minimal correlation with attack success rate across medical imaging models and datasets, making ASR alone inadequate for assessing adversarial robustness.