Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

Nicolas Papernot , Patrick McDaniel , Ian Goodfellow

Authors on Pith no claims yet

classification 💻 cs.CR cs.LG

keywords modellearningmachinesubstitutevictimadversarialattackstraining

read the original abstract

Many machine learning models are vulnerable to adversarial examples: inputs that are specially crafted to cause a machine learning model to produce an incorrect output. Adversarial examples that affect one model often affect another model, even if the two models have different architectures or were trained on different training sets, so long as both models were trained to perform the same task. An attacker may therefore train their own substitute model, craft adversarial examples against the substitute, and transfer them to a victim model, with very little information about the victim. Recent work has further developed a technique that uses the victim model as an oracle to label a synthetic training set for the substitute, so the attacker need not even collect a training set to mount the attack. We extend these recent techniques using reservoir sampling to greatly enhance the efficiency of the training procedure for the substitute model. We introduce new transferability attacks between previously unexplored (substitute, victim) pairs of machine learning model classes, most notably SVMs and decision trees. We demonstrate our attacks on two commercial machine learning classification systems from Amazon (96.19% misclassification rate) and Google (88.94%) using only 800 queries of the victim model, thereby showing that existing machine learning approaches are in general vulnerable to systematic black-box attacks regardless of their structure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Local Hessian Spectral Filtering for Robust Intrinsic Dimension Estimation
cs.LG 2026-05 unverdicted novelty 7.0

LHSD uses spectral filtering on the log-density Hessian to isolate tangent directions from noise and estimate local intrinsic dimension scalably via Stochastic Lanczos Quadrature.
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
Red Teaming Language Models with Language Models
cs.CL 2022-02 conditional novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
cs.CR 2017-12 unverdicted novelty 7.0

Injecting around 50 poisoned samples with a stealthy trigger creates backdoors in deep learning models achieving over 90% attack success under a weak threat model with no model or data knowledge required.
Towards Deep Learning Models Resistant to Adversarial Attacks
stat.ML 2017-06 accept novelty 7.0

Adversarial training via projected gradient descent on the inner maximization problem produces neural networks with substantially improved resistance to a wide range of attacks and establishes security against first-o...
Content Fuzzing for Escaping Information Cocoons on Digital Social Media
cs.CL 2026-04 unverdicted novelty 6.0

ContentFuzz rewrites posts with LLM guidance from stance model confidence to flip machine labels without altering human intent, tested across four models and three datasets in two languages.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
cs.CL 2023-10 conditional novelty 6.0

AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...
Laundering AI Authority with Adversarial Examples
cs.CR 2026-05 unverdicted novelty 5.0

Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
SoK: A Comprehensive Analysis of the Current Status of Neural Tangent Generalization Attacks with Research Directions
cs.LG 2026-05 accept novelty 3.0

NTGA is the first clean-label generalization attack under black-box settings but is vulnerable to adversarial training and image transformations, with newer attacks outperforming it.