pith. machine review for the scientific record. sign in

arxiv: 2604.02780 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

A Unified Perspective on Adversarial Membership Manipulation in Vision Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords membership inference attacksadversarial robustnessvision modelsgradient normsprivacyadversarial examplesmodel extraction
0
0 comments X

The pith

Adversarial perturbations can fabricate membership for non-training images in vision model attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that small, imperceptible changes to images can make membership inference attacks incorrectly classify non-training data as training data in vision models. It identifies a consistent geometric pattern where the norm of gradients collapses in a specific way for these fabricated cases. This pattern differs from true members even when the images look similar semantically. Using this, the authors create ways to detect the manipulation and make inference more robust. If true, it means current privacy assessments for AI models need to consider adversarial inputs.

Core claim

Adversarial membership fabrication pushes non-member images into the member region of MIAs across models and datasets, and a characteristic gradient-norm collapse trajectory separates these fabricated members from genuine ones despite nearly identical semantics, supporting detection and robust inference methods.

What carries the argument

The gradient-norm collapse trajectory that distinguishes fabricated from true members in the attack process.

If this is right

  • Adversarial fabrication works reliably on various vision architectures and datasets.
  • Gradient-geometry signals can detect the manipulation effectively.
  • Robust inference frameworks can reduce the success of such adversarial attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This vulnerability suggests that membership inference tools themselves require adversarial training or hardening.
  • Similar manipulation might apply to other types of privacy attacks beyond membership inference.
  • Defenses based on gradient norms could be integrated into standard model evaluation pipelines.

Load-bearing premise

The gradient-norm collapse trajectory reliably separates fabricated members from true members despite nearly identical semantic representations.

What would settle it

Finding a set of fabricated and true members where the gradient-norm trajectories are statistically indistinguishable would disprove the separation method.

Figures

Figures reproduced from arXiv: 2604.02780 by Feng Liu, Kaiwen Zhou, Ruize Gao, Yongqiang Chen.

Figure 1
Figure 1. Figure 1: Overview of the Background and Our Proposed Research Problems. MEMGUARD defends the model, whereas we show that the auditing mechanism itself can be attacked. A unified geometric perspective. Through theoretical anal￾ysis and empirical validation, we find that these perturbed non-members, fabricated members, share a distinct opti￾mization geometry: their input-gradient norms progressively collapse as pertu… view at source ↗
Figure 2
Figure 2. Figure 2: Imperceptible Adversarial Perturbations on ImageNet-100. The first row are the original non-members, and the second row are the corresponding perturbed fabricated members. We used ϵ = 2/255 for Bϵ[x] here. The perturbations are extremely imperceptible to the human eye, which demonstrates that the Member Fabrication Attack (MFA) can be successful with only the addition of very small perturbations. 2.2. Memb… view at source ↗
Figure 3
Figure 3. Figure 3: Objective of adversarial attacks (left) vs. MFA (right). The black and red dots denote the original input and the perturbed sample within the ϵ￾ball (gray region). Adversarial attacks push inputs into the misclassification region (orange), where maxi̸=y pi > py. In contrast, MFA drives inputs into high-confidence regions. that enables reliable detection (§3.2), and finally incorporate this geometric insigh… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the Distribution of Fabricated and True Members in Different Semantic Feature Spaces Using t-SNE [38]. The two subfigures represent the semantic features at the penultimate and antepenultimate layers, with perturbation constrained to |δ|∞ ≤ 4/255. Red and blue dots denote true and fabricated members, respectively. The high degree of overlap of the red dots and blue dots suggests that seman… view at source ↗
Figure 5
Figure 5. Figure 5: Decay of Gradient Norm with Respect to Input Across Steps. As the steps increases, the gradient norm with respect to the input progressively diminishes. For clarity, a large epsilon ball (∥δ∥∞ ≤ 8/255) is selected, alongside a small initial step size of 1/24 × (8/255), across 20 steps. Empirical observation and geometric interpretation. Al￾though fabricated and true members are visually and seman￾tically s… view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of fabricated and true members across different detection strategies. The first plot (left) shows the Mahalanobis distance, the second plot (middle) shows the Local Intrinsic Dimensionality (LID) values, and the third plot (right) presents the gradient norm with respect to the input. We use the relative frequency within each membership class as the vertical axis. The gradient norm in the third… view at source ↗
Figure 7
Figure 7. Figure 7: Representative Experimental Results. Subfigures (a-c) demonstrate the superior performance of (MFA) compared to baselines, highlighting its effectiveness across different MIAs and datasets. Subfigures (d-f) show the effectiveness of (MFD) across varying perturbation levels ϵ and datasets. Subfigures (g-i) illustrate the effectiveness of our Adversarially Robust Strategies in combination with different stro… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the Distribution of Fabricated and True Members in Different Semantic Feature Spaces Using t-SNE [38]. The first row displays the semantic features at the penultimate layer, with perturbation constrained to (∥δ∥∞ ≤ 2.0/255) to (∥δ∥∞ ≤ 6.0/255) from left to right. The second row shows the semantic features at the antepenultimate layer, with the same range of perturbations. Red dots represen… view at source ↗
Figure 9
Figure 9. Figure 9: Imperceptible Adversarial Perturbations on ImageNet-100. For each pair, the top image is the original non-member, and the bottom image is the corresponding perturbed fabricated member, demonstrating that the perturbations are imperceptible to the human eyes. We used ϵ = 2/255 for Bϵ[x] here [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Imperceptible Adversarial Perturbations on ImageNet-100. For each pair, the top image is the original non-member, and the bottom image is the corresponding perturbed fabricated member, demonstrating that the perturbations are imperceptible to the human eyes. We used ϵ = 2/255 for Bϵ[x] here [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Imperceptible Adversarial Perturbations on ImageNet-100. For each pair, the top image is the original non-member, and the bottom image is the corresponding perturbed fabricated member, demonstrating that the perturbations are imperceptible to the human eyes. We used ϵ = 2/255 for Bϵ[x] here [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse MIAs (∥δ∥∞ ≤ 4.0/255) [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse Perturbation Bounds on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse Perturbation Bounds on CIFAR-100 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse Perturbation Bounds on CINIC-10 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse Perturbation Bounds on SVHN [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of the Error Area Between Our Member Fabrication Attack and Baselines Across Diverse Perturbation Bounds on ImageNet-100 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison of the ROC Curve for Our Member Fabrication Detection Across Diverse Perturbation Bounds on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of the ROC Curve for Our Member Fabrication Detection Across Diverse Perturbation Bounds on CIFAR-100 [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of the ROC Curve for Our Member Fabrication Detection Across Diverse Perturbation Bounds on CINIC-10 [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Comparison of the ROC Curve for Our Member Fabrication Detection Across Diverse Perturbation Bounds on SVHN [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of the ROC Curve for Our Member Fabrication Detection Across Diverse Perturbation Bounds on ImageNet-100 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of ROC Curves for Our Adversarially Robust MIAs and Baselines on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Comparison of ROC Curves for Our Adversarially Robust MIAs and Baselines on CIFAR-100 [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of ROC Curves for Our Adversarially Robust MIAs and Baselines on SVHN [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Comparison of ROC Curves for Our Adversarially Robust MIAs and Baselines on CINIC-10 [PITH_FULL_IMAGE:figures/full_fig_p038_26.png] view at source ↗
read the original abstract

Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model's training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the "member" region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature - a characteristic gradient-norm collapse trajectory - that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that membership inference attacks (MIAs) on vision models have an overlooked adversarial surface: imperceptible perturbations can fabricate membership for non-member images with high success across architectures and datasets. It identifies a characteristic gradient-norm collapse trajectory as a geometric signature that separates fabricated members from true members despite similar semantics. Building on this, the authors propose a gradient-geometry-based detection strategy and a robust inference framework that substantially mitigates fabrication success. Extensive experiments are reported to show consistent fabrication effectiveness and improved resilience under the proposed defenses.

Significance. If the gradient-norm separation and detection hold under realistic conditions, the work provides the first unified framework for adversarial membership manipulation in vision models. It highlights a new privacy leakage vector beyond standard MIAs and offers concrete detection and robust inference tools. Strengths include the broad experimental coverage across architectures/datasets and the insight that geometric signatures can distinguish fabricated from true members even when semantic representations are nearly identical.

major comments (2)
  1. [§5] §5 and abstract: The central claim that the robust inference framework 'substantially mitigates' adversarial fabrication rests on the gradient-norm collapse trajectory reliably separating classes under the paper's threat model. However, all reported experiments (§4) use standard PGD-style objectives without knowledge of the detector. No evaluation is provided for adaptive adversaries that explicitly optimize perturbations to match the gradient-norm trajectory of true members (e.g., by adding a matching term to the loss). This is load-bearing for the resilience conclusion.
  2. [§4] §4 (experimental setup): The soundness of the fabrication effectiveness and detection claims cannot be fully verified without reported data splits, exact MIA implementations, statistical significance tests (e.g., confidence intervals on success rates), or ablation on the gradient-norm threshold choice. The abstract notes consistent effectiveness, but missing details prevent independent reproduction of the separation results.
minor comments (2)
  1. [§3] Notation for the gradient-norm trajectory and detection threshold should be formalized with an equation in §3 to improve clarity and reproducibility.
  2. [Figures in §4] Figure captions and axis labels in the geometric signature plots could be expanded to explicitly state what 'collapse trajectory' quantifies (e.g., norm over perturbation steps).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to strengthen the reproducibility and robustness claims of the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 and abstract: The central claim that the robust inference framework 'substantially mitigates' adversarial fabrication rests on the gradient-norm collapse trajectory reliably separating classes under the paper's threat model. However, all reported experiments (§4) use standard PGD-style objectives without knowledge of the detector. No evaluation is provided for adaptive adversaries that explicitly optimize perturbations to match the gradient-norm trajectory of true members (e.g., by adding a matching term to the loss). This is load-bearing for the resilience conclusion.

    Authors: We agree that evaluating resilience against adaptive adversaries who are aware of the detector is necessary to fully support the claim. In the revised manuscript, we will add experiments in which the adversary augments the PGD objective with an explicit term that penalizes deviation from the gradient-norm trajectory of true members. We will report fabrication success rates under this stronger threat model and discuss any degradation in the separation provided by the gradient-norm signal. revision: yes

  2. Referee: [§4] §4 (experimental setup): The soundness of the fabrication effectiveness and detection claims cannot be fully verified without reported data splits, exact MIA implementations, statistical significance tests (e.g., confidence intervals on success rates), or ablation on the gradient-norm threshold choice. The abstract notes consistent effectiveness, but missing details prevent independent reproduction of the separation results.

    Authors: We acknowledge the need for greater experimental transparency. In the revision we will (i) explicitly state the train/test splits and random seeds used for all datasets, (ii) provide precise pseudocode and hyper-parameter settings for each MIA implementation, (iii) report 95% confidence intervals on all success-rate figures, and (iv) include an ablation study varying the gradient-norm threshold to demonstrate the stability of the detection performance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical observations and strategies are independent of fitted inputs or self-citations

full rationale

The paper presents no mathematical derivations, equations, or parameter-fitting steps that reduce to self-definition or prior results by construction. Claims of fabrication effectiveness and the gradient-norm collapse trajectory as a separating signature are introduced as empirical findings from experiments across architectures and datasets. The detection strategy and robust inference framework are built directly on these observed geometric signals without invoking self-citations, uniqueness theorems, or ansatzes that loop back to the paper's own inputs. No load-bearing self-citation chains or renamed known results appear in the abstract or described content. The work is self-contained via external experimental validation, consistent with a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions from membership inference literature plus an empirically observed geometric signature; no new free parameters or invented entities are explicitly introduced beyond experimental thresholds.

free parameters (1)
  • gradient-norm threshold
    Likely used to separate collapse trajectories in the detection strategy; value not stated in abstract.
axioms (1)
  • domain assumption MIAs can be evaluated via output or gradient signals from the target model
    Standard premise in membership inference research invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1110 out tokens · 39199 ms · 2026-05-13T20:01:40.294456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 1 internal anchor

  1. [1]

    Quantifying membership inference vulner- ability via generalization gap and other model metrics.arXiv preprint arXiv:2009.05669, 2020

    Jason W Bentley, Daniel Gibney, Gary Hoppenworth, and Sumit Kumar Jha. Quantifying membership inference vulner- ability via generalization gap and other model metrics.arXiv preprint arXiv:2009.05669, 2020. 13

  2. [2]

    A survey of black-box adversarial attacks on computer vision models.arXiv preprint arXiv:1912.01667,

    Siddhant Bhambri, Sumanyu Muku, Avinash Tulasi, and Arun Balaji Buduru. A survey of black-box adversarial attacks on computer vision models.arXiv preprint arXiv:1912.01667,

  3. [3]

    Pattern recognition.Machine learning, 128(9), 2006

    Christopher M Bishop. Pattern recognition.Machine learning, 128(9), 2006. 13

  4. [4]

    Adversarial examples are not easily detected: Bypassing ten detection methods

    Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In ACM Workshop on Artificial Intelligence and Security, 2017. 13

  5. [5]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. InCVPR, 2017. 3, 14, 19, 20

  6. [6]

    The secret sharer: Evaluating and testing unintended memorization in neural networks

    Nicholas Carlini, Chang Liu, ´Ulfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. InUSENIX Security, 2019. 13

  7. [7]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE Symposium on Security and Privacy, 2022. 1, 2, 3, 4, 8, 12, 13, 18, 19, 20

  8. [8]

    Deepdriving: Learning affordance for direct perception in autonomous driving

    Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. InICCV, 2015. 13

  9. [9]

    Gan- leaks: A taxonomy of membership inference attacks against generative models

    Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. Gan- leaks: A taxonomy of membership inference attacks against generative models. InACM SIGSAC Conference on Computer and Communications Security, 2020. 13

  10. [10]

    When machine unlearn- ing jeopardizes privacy

    Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. When machine unlearn- ing jeopardizes privacy. InACM SIGSAC Conference on Computer and Communications Security, 2021. 12

  11. [11]

    Adversarial robustness: From self-supervised pre-training to fine-tuning

    Tianlong Chen, Sijia Liu, Shiyu Chang, Yu Cheng, Lisa Amini, and Zhangyang Wang. Adversarial robustness: From self-supervised pre-training to fine-tuning. InCVPR, 2020. 14

  12. [12]

    Label-only membership infer- ence attacks

    Christopher A Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only membership infer- ence attacks. InICML, 2021. 12

  13. [13]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks

    Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter- free attacks. InICML, 2020. 3, 4, 8, 14, 19, 20

  14. [14]

    Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505, 2018

    Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505, 2018. 8, 19, 20

  15. [15]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 8, 19, 20

  16. [16]

    Model inversion attacks that exploit confidence information and basic countermeasures

    Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. InACM SIGSAC Conference on Computer and Communications Security, 2015. 12

  17. [17]

    Property inference attacks on fully connected neural networks using permutation invariant representations

    Karan Ganju, Qi Wang, Wei Yang, Carl A Gunter, and Nikita Borisov. Property inference attacks on fully connected neural networks using permutation invariant representations. In ACM SIGSAC Conference on Computer and Communications Security, 2018. 12

  18. [18]

    Maximum mean discrepancy test is aware of adversarial attacks

    Ruize Gao, Feng Liu, Jingfeng Zhang, Bo Han, Tongliang Liu, Gang Niu, and Masashi Sugiyama. Maximum mean discrepancy test is aware of adversarial attacks. InICML,

  19. [19]

    Fast and reliable evaluation of adversarial robustness with minimum- margin attack

    Ruize Gao, Jiongxiao Wang, Kaiwen Zhou, Feng Liu, Binghui Xie, Gang Niu, Bo Han, and James Cheng. Fast and reliable evaluation of adversarial robustness with minimum- margin attack. InICML, 2022. 3, 14, 19

  20. [20]

    Explaining and harnessing adversarial examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InICLR,

  21. [21]

    On the ( Statistical ) Detection of Adversarial Examples

    Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples.arXiv:1702.06280, 2017. 14

  22. [22]

    Simple black-box adversarial attacks

    Chuan Guo, Jacob Gardner, Yurong You, Andrew Gordon Wilson, and Kilian Weinberger. Simple black-box adversarial attacks. InICML, 2019. 4

  23. [23]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 8, 12, 19

  24. [24]

    Sorami Hisamoto, Matt Post, and Kevin Duh. Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system?Transactions of the Association for Computational Linguistics, 8:49–63, 2020. 12

  25. [25]

    Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays.PLOS Genetics, 4:1–9, 2008

    Nils Homer, Szabolcs Szelinger, Margot Redman, David Dug- gan, Waibhav Tembe, Jill Muehling, John V Pearson, Di- etrich A Stephan, Stanley F Nelson, and David W Craig. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays.PLOS Genetics, 4:1–9, 2008. 1, 3

  26. [26]

    Scalable continuous-time diffusion framework for network inference and influence estimation

    Keke Huang, Ruize Gao, Bogdan Cautis, and Xiaokui Xiao. Scalable continuous-time diffusion framework for network inference and influence estimation. InWWW, 2024. 12

  27. [27]

    Black-box adversarial attacks with limited queries and information

    Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. InICML, 2018. 4

  28. [28]

    Memguard: Defending against black- box membership inference attacks via adversarial examples

    Jinyuan Jia, Ahmed Salem, Michael Backes, Yang Zhang, and Neil Zhenqiang Gong. Memguard: Defending against black- box membership inference attacks via adversarial examples. InProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2019. 1

  29. [29]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 8, 18, 19, 20

  30. [30]

    Adver- sarial examples in the physical world

    Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Adver- sarial examples in the physical world. InICLR, 2017. 13, 19, 20

  31. [31]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. InNeurIPS, 2018. 5, 14

  32. [32]

    Stolen memories: Leverag- ing model memorization for calibrated white-box membership inference

    Klas Leino and Matt Fredrikson. Stolen memories: Leverag- ing model memorization for calibrated white-box membership inference. InUSENIX Security, 2020. 12, 13, 18

  33. [33]

    Membership in- ference attacks and defenses in classification models

    Jiacheng Li, Ninghui Li, and Bruno Ribeiro. Membership in- ference attacks and defenses in classification models. InACM Conference on Data and Application Security and Privacy,

  34. [34]

    Adversarial examples detection in deep networks with convolutional filter statistics

    Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional filter statistics. InICCV, 2017. 14

  35. [35]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983,

  36. [36]

    Characterizing adversarial sub- spaces using local intrinsic dimensionality

    Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial sub- spaces using local intrinsic dimensionality. InICLR, 2018. 5, 14

  37. [37]

    Understanding adversarial at- tacks on deep learning based medical image analysis systems

    Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. Understanding adversarial at- tacks on deep learning based medical image analysis systems. Pattern Recognition, 2021. 13

  38. [38]

    Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (Nov):2579–2605, 2008. 5, 17, 19

  39. [39]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InICLR, 2018. 3, 4, 8, 13, 19, 20

  40. [40]

    Exploiting unintended feature leakage in collaborative learning

    Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unintended feature leakage in collaborative learning. InIEEE Symposium on Security and Privacy, 2019. 18

  41. [41]

    On detecting adversarial perturbations

    Jan Hendrik Metzen, Tim Genewein, V olker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv:1702.04267, 2017. 14

  42. [42]

    Deep learning for healthcare: review, opportunities and challenges.Briefings in bioinformatics, 19 (6):1236–1246, 2018

    Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. Deep learning for healthcare: review, opportunities and challenges.Briefings in bioinformatics, 19 (6):1236–1246, 2018. 12

  43. [43]

    Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020

    Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning.arXiv preprint arXiv:2007.09339, 2020. 13, 14, 18

  44. [44]

    Machine learning with membership privacy using adversarial regu- larization

    Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using adversarial regu- larization. InACM SIGSAC Conference on Computer and Communications Security, 2018. 12

  45. [45]

    Compre- hensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning

    Milad Nasr, Reza Shokri, and Amir Houmansadr. Compre- hensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. InIEEE symposium on security and privacy, 2019. 7, 12

  46. [46]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y . Ng. Reading digits in natural images with unsupervised feature learning. InNeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. 19

  47. [47]

    Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses.Philosoph- ical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933. 12

  48. [48]

    Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

    Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. InCVPR, 2015. 13

  49. [49]

    Curtail: Characterizing and thwarting adversarial deep learning.arXiv:1709.02538, 2017

    Bita Darvish Rouhani, Mohammad Samragh, Tara Javidi, and Farinaz Koushanfar. Curtail: Characterizing and thwarting adversarial deep learning.arXiv:1709.02538, 2017. 14

  50. [50]

    White-box vs black- box: Bayes optimal strategies for membership inference

    Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, and Herv ´e J ´egou. White-box vs black- box: Bayes optimal strategies for membership inference. In ICML, 2019. 12

  51. [51]

    Ml-leaks: Model and data indepen- dent membership inference attacks and defenses on machine learning models

    Ahmed Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. Ml-leaks: Model and data indepen- dent membership inference attacks and defenses on machine learning models. InAnnual Network and Distributed System Security Symposium, 2019. 12, 13, 18

  52. [52]

    Hats: Hardness- aware trajectory synthesis for gui agents

    Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, and Gongwei Chen. Hats: Hardness- aware trajectory synthesis for gui agents. InCVPR, 2026. 12

  53. [53]

    Membership inference attacks against machine learning models

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. InIEEE Symposium on Security and Privacy,

  54. [54]

    Machine learning models that remember too much

    Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. Machine learning models that remember too much. InACM SIGSAC Conference on Computer and Communications Se- curity, 2017. 13

  55. [55]

    Systematic evaluation of pri- vacy risks of machine learning models

    Liwei Song and Prateek Mittal. Systematic evaluation of pri- vacy risks of machine learning models. InUSENIX Security,

  56. [56]

    Introducing a new privacy testing library in tensorflow, 2020

    Shuang Song and David Marn. Introducing a new privacy testing library in tensorflow, 2020. 14

  57. [57]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InICLR, 2014. 3, 13

  58. [58]

    Stealing machine learning models via prediction apis

    Florian Tram`er, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. InUSENIX Security, 2016. 12

  59. [59]

    The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline

    Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, and Kenji Kawaguchi. The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline. InICML, 2024. 14

  60. [60]

    On the convergence and robustness of adversarial training

    Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. InICML, 2019. 13, 14

  61. [61]

    On the importance of difficulty calibra- tion in membership inference attacks.arXiv preprint arXiv:2111.08440, 2021

    Lauren Watson, Chuan Guo, Graham Cormode, and Alex Sablayrolles. On the importance of difficulty calibra- tion in membership inference attacks.arXiv preprint arXiv:2111.08440, 2021. 1, 3, 12, 18

  62. [62]

    Adversarial weight perturbation helps robust generalization

    Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. InNeurIPS,

  63. [63]

    The human splicing code reveals new insights into the genetic determinants of disease.Science, 347(6218), 2015

    Hui Y Xiong, Babak Alipanahi, Leo J Lee, Hannes Bretschneider, Daniele Merico, Ryan KC Yuen, Yimin Hua, Serge Gueroussov, Hamed S Najafabadi, Timothy R Hughes, et al. The human splicing code reveals new insights into the genetic determinants of disease.Science, 347(6218), 2015. 12

  64. [64]

    Enhanced membership in- ference attacks against machine learning models

    Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. Enhanced membership in- ference attacks against machine learning models. InACM SIGSAC Conference on Computer and Communications Se- curity, 2022. 1, 2, 3, 4, 8, 12, 18, 19

  65. [65]

    Privacy risk in machine learning: Analyzing the connection to overfitting

    Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. InIEEE computer security foundations symposium, 2018. 1, 3, 4, 8, 12, 13, 19

  66. [66]

    Wide residual networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InBMVC, 2016. 8, 19

  67. [67]

    Low-cost high-power membership inference attacks

    Sajjad Zarifzadeh, Philippe Liu, and Reza Shokri. Low-cost high-power membership inference attacks. InICML, 2024. 1, 3, 4, 8, 12, 19, 20

  68. [68]

    Understanding deep learning (still) re- quires rethinking generalization.Communications of the ACM, 64:107–115, 2021

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) re- quires rethinking generalization.Communications of the ACM, 64:107–115, 2021. 13

  69. [69]

    Geometry-aware instance-reweighted adversarial training

    Jingfeng Zhang, Jianing Zhu, Gang Niu, Bo Han, Masashi Sugiyama, and Mohan Kankanhalli. Geometry-aware instance-reweighted adversarial training. InICLR, 2021. 14

  70. [70]

    Dual-path distillation: A unified framework to improve black- box attacks

    Yonggang Zhang, Ya Li, Tongliang Liu, and Xinmei Tian. Dual-path distillation: A unified framework to improve black- box attacks. InICML, 2020. 13 A. Related Works A.1. Privacy Risks in Machine Learning Machine learning(ML) models, particularly deep neural networks, have become integral to advancements in vari- ous high-stakes domains such as healthcare, ...

  71. [71]

    Instead of taking a single step of size ϵ in the direction of the gradient sign, multiple smaller steps are taken in PGD (the result is clipped by the same ϵ)

    introduced a simple refinement of FGSM, which is the projected gradient descent (PGD) attack. Instead of taking a single step of size ϵ in the direction of the gradient sign, multiple smaller steps are taken in PGD (the result is clipped by the same ϵ). Specifically, we start with setting x0 =x , and then in each iteration: x′ (t+1) = ΠBϵ[x(0)](x′ (t) +αs...

  72. [72]

    Motivated by this, [5] replaced the CE loss with several possible choices

    observed the phenomenon of gradient vanishing in the widely used CE loss for potential failure. Motivated by this, [5] replaced the CE loss with several possible choices. Among these choices, the widely used one for the untargeted attack is CW(x, y) =−z y(x′) + max i̸=y zi(x′).(21) wherezis the logits of the model outputs. AutoAttack and Minimum Margin at...

  73. [73]

    gradient-norm collapse

    argued that the high computational cost ofAutoAttack is unnecessary for identifyingthe most adversarial example: Definition 6(The most adversarial example).Given a natu- ral example x with its true label y and, the most adversarial examplex ∗ withinB ϵ[x]is defined as: ∀x′ ∈ B ϵ[x], x∗ = arg max x′ −(zy(x′)−max i̸=y zi(x′)), (22) where Bϵ[x] ={x ′ |d ∞(x,...

  74. [74]

    Shadow Models.ForAttack R[ 64], we train 100 reference models (OUT-Models)

    The learning rate is initialized to τ= 0.1 and follows a cosine annealing schedule, gradually decaying to zero over 100 epochs. Shadow Models.ForAttack R[ 64], we train 100 reference models (OUT-Models). ForLiRA[ 7] andRMIA[ 67], we train 100 IN-Models and 100 OUT-Models for modeling ˜Qin/out, and ensure that the same shadow models are used across differe...