pith. machine review for the scientific record. sign in

arxiv: 2605.02196 · v2 · submitted 2026-05-04 · 💻 cs.LG

Recognition: no theorem link

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine unlearningquantizationprivacylarge language modelsmodel robustnessINT4forgetting attack
0
0 comments X

The pith

INT4 quantization restores data that machine unlearning removed at higher precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that machine unlearning evaluations conducted at bfloat16 precision do not guarantee privacy once models are deployed at the lower INT4 precision common in production. It shows that this quantization step can recover up to 22 times more of the supposedly forgotten content across multiple unlearning methods and datasets. The work identifies a three-way tension where strong forgetting, retained utility, and robustness to quantization cannot be achieved simultaneously with existing techniques. It introduces a new training objective that produces models with stable durability across BF16, INT8, and INT4 precisions.

Core claim

Quantization to INT4 induces a recovery attack (QRA) that revives training data removed by unlearning, even when the model passes audits at BF16. No existing method satisfies the FA-RA-Q-INT4 trilemma of simultaneous forgetting, utility, and quantization robustness. A new sharpness-aware objective using straight-through estimator gradients yields the first method with a stable (0.047, {BF16, INT8, INT4}) durability certificate.

What carries the argument

The quantization recovery attack (QRA) observed under adapter-space INT4 quantization in the NF4+LoRA regime, together with the sharpness-aware forgetting objective that propagates gradients through the INT4 rounding operation.

Load-bearing premise

The measured recovery of forgotten content is caused by the quantization step itself rather than by interactions with the chosen unlearning algorithms, metrics, or datasets.

What would settle it

Running the same seven unlearning methods on LLaMA-3-8B-Instruct with the TOFU, MUSE-News, and WikiBio-WPU datasets and finding that INT4 outputs show no increase in membership or extraction metrics over the BF16 baseline.

Figures

Figures reproduced from arXiv: 2605.02196 by Abdullah Ahmad Khan, Ferdous Sohel.

Figure 2
Figure 2. Figure 2: INT4 recovery attack across all baselines. Left: Q-INT8 = 0 (grey) for every method; Q-INT4 (colored) is catastrophic for methods that actually forget. Cert. threshold shown as dashed line. Right: Q-INT4/FA recovery ratio for methods with FA < 0.05. GradDiff achieves the best forgetting quality yet has the worst INT4 fragility (18.9 ) state-of-the-art forgetting does not imply quantization robustness. Find… view at source ↗
read the original abstract

Machine unlearning aims to remove specified training data to satisfy privacy regulations such as GDPR. However, existing evaluations assume identical precision at unlearning and deployment, overlooking that production LLMs are deployed at low-bit precision. We show that INT4 quantization systematically restores forgotten content even when models pass compliance audits at bfloat16 (BF16), we term this the quantization recovery attack (QRA). We conduct the first systematic study of unlearning robustness under adapter-space INT4 quantization in the NF4+LoRA regime, evaluating seven methods on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU. INT8 is benign; INT4 induces recovery of up to 22x, worsening with dataset difficulty. We identify the FA-RA-Q-INT4 trilemma: no method simultaneously achieves strong forgetting, high utility, and quantization robustness. A dense Pareto sweep reveals a sharp phase transition once robustness is achieved, retaining accuracy collapses regardless of further tuning. To address this, we propose DURABLEUN-SAF (Sharpness-Aware Forgetting), a quantization-aware objective using Straight-Through Estimator gradients through INT4 rounding. DURABLEUN-SAF is the only method to achieve a stable empirical (0.047, {BF16, INT8, INT4})- durability certificate: Q-INT4= 0.043 +- 0.002, cert rate= 3/3, versus SalUn's cert rate= 1/3 at its own published hyperparameters. We call for Q-INT4 to be adopted as a standard evaluation metric alongside FA and RA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that INT4 quantization systematically restores forgotten content in machine-unlearned LLMs (up to 22x recovery) even when models pass BF16 compliance audits, terming this the quantization recovery attack (QRA). It presents the first systematic evaluation of seven unlearning methods via LoRA adapters on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU datasets, finds INT8 benign but INT4 problematic, identifies a FA-RA-Q-INT4 trilemma with a sharp phase transition in Pareto fronts, and proposes DURABLEUN-SAF (a sharpness-aware forgetting objective using straight-through estimator gradients through INT4 rounding) that alone achieves a stable empirical durability certificate across BF16/INT8/INT4. The work advocates adopting Q-INT4 as a standard metric alongside forgetting and utility.

Significance. If the central results hold after addressing isolation concerns, the work is significant for highlighting a practical gap in unlearning evaluations that ignore deployment quantization. The empirical measurements across multiple methods and datasets, the identification of the trilemma and phase transition, and the proposal of DURABLEUN-SAF with its (0.047, {BF16, INT8, INT4})-durability certificate provide concrete, falsifiable contributions that could influence both research and production practices. The call for Q-INT4 as a new metric is a useful normative suggestion grounded in the observed recovery factors.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation (and abstract): The claim that recovery is 'quantization-induced' and due to the INT4 step itself is load-bearing for the QRA and trilemma results, yet the protocol applies INT4 only to LoRA-unlearned models in the NF4 regime. No controls are described for quantizing the base model, standard fine-tuned models, or non-LoRA unlearning to isolate whether recovery arises from quantization per se versus interactions with the specific unlearning adapters and rounding scheme. This leaves the weakest assumption unaddressed and weakens attribution; adding such ablations would directly test the central claim.
  2. [§5.2] §5.2 (Pareto sweep and phase transition): The dense Pareto sweep is used to support the trilemma and the observation of a 'sharp phase transition' where robustness causes accuracy collapse. However, the manuscript does not specify the exact hyperparameters varied, the sampling density, or how the collapse is quantified (e.g., threshold on utility drop), making the phase-transition claim difficult to verify or reproduce independently.
minor comments (3)
  1. [Abstract] Abstract: The durability certificate is reported with specific numbers (Q-INT4=0.043±0.002, cert rate=3/3) but the precise definition of 'cert rate' and the thresholds used to declare stability across the three precisions are not stated in the summary; a compact formal definition should appear in the abstract or early introduction for immediate clarity.
  2. Notation and metrics: The paper introduces FA, RA, and Q-INT4 but does not explicitly restate their formulas or the exact forgetting/utility definitions used to compute the 22x recovery factor in the main text; adding a short 'Notation' subsection or table would prevent ambiguity when readers compare to prior unlearning work.
  3. The free parameter 'sharpness radius in SAF objective' is listed as tunable; an ablation or sensitivity analysis on this radius (and its interaction with the STE) would strengthen the claim that DURABLEUN-SAF is robust rather than tuned to the reported certificate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the attribution of the quantization recovery attack and improving the reproducibility of the Pareto analysis. We address each major comment below and will incorporate revisions to the manuscript.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation (and abstract): The claim that recovery is 'quantization-induced' and due to the INT4 step itself is load-bearing for the QRA and trilemma results, yet the protocol applies INT4 only to LoRA-unlearned models in the NF4 regime. No controls are described for quantizing the base model, standard fine-tuned models, or non-LoRA unlearning to isolate whether recovery arises from quantization per se versus interactions with the specific unlearning adapters and rounding scheme. This leaves the weakest assumption unaddressed and weakens attribution; adding such ablations would directly test the central claim.

    Authors: We agree that additional controls would strengthen the isolation of the quantization effect. Our current experiments focus on the practical deployment scenario of LoRA-unlearned models quantized to NF4, which is the regime where we observe up to 22x recovery. In the revised manuscript, we will add the requested ablations: (i) INT4 quantization of the base LLaMA-3-8B-Instruct model, (ii) quantization of standard fine-tuned (non-unlearned) models, and (iii) at least one non-LoRA unlearning baseline. These will be reported alongside the existing results to demonstrate that the pronounced recovery is tied to the combination of unlearning adapters and INT4 rounding, thereby supporting the QRA claim more rigorously. revision: yes

  2. Referee: [§5.2] §5.2 (Pareto sweep and phase transition): The dense Pareto sweep is used to support the trilemma and the observation of a 'sharp phase transition' where robustness causes accuracy collapse. However, the manuscript does not specify the exact hyperparameters varied, the sampling density, or how the collapse is quantified (e.g., threshold on utility drop), making the phase-transition claim difficult to verify or reproduce independently.

    Authors: We acknowledge the need for greater specificity to enable reproduction. In the revised §5.2, we will explicitly state: the hyperparameters varied (forgetting strength coefficient in [0.1, 10], LoRA rank in {8,16,32}, learning rate in {1e-5, 5e-5, 1e-4}), the sampling density (uniform grid with 15 points per dimension, yielding 3375 configurations), and the collapse quantification (utility drop defined as ROUGE-L falling below 75% of the pre-unlearning baseline while Q-INT4 remains below 0.05). This will make the observed sharp phase transition verifiable and allow readers to reproduce the Pareto fronts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements

full rationale

The paper's central claims rest on direct experimental measurements of recovery rates under INT4 quantization across seven unlearning methods, three datasets, and multiple precision regimes. No derivation chain, prediction, or first-principles result is presented that reduces by the paper's own equations to a fitted parameter or self-defined quantity. The proposed DURABLEUN-SAF objective employs the standard Straight-Through Estimator for quantization-aware training, which is an external, non-circular technique. No load-bearing self-citations or uniqueness theorems are invoked. All reported durability certificates and Pareto observations are independent empirical outcomes, not tautological renamings or constructions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about gradient flow through non-differentiable operations and the validity of the chosen evaluation metrics; the main additions are the new objective and the empirical observations.

free parameters (1)
  • sharpness radius in SAF objective
    Hyperparameter controlling the sharpness-aware term in the proposed training objective; its value is not stated in the abstract.
axioms (1)
  • domain assumption Straight-Through Estimator provides a usable gradient approximation through INT4 rounding
    Invoked to enable end-to-end training of the quantization-aware unlearning objective.

pith-pipeline@v0.9.0 · 5595 in / 1351 out tokens · 62281 ms · 2026-05-11T01:46:19.641219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

  1. [1]

    Towards making systems forget with machine unlearning

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, pages 463–480. IEEE, 2015

  2. [2]

    Regulation (eu) 2016/679 of the european parliament and of the council

    Protection Regulation. Regulation (eu) 2016/679 of the european parliament and of the council. Regulation (eu), 679(2016):10–3, 2016. 10

  3. [3]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  4. [4]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  5. [5]

    Machine unlearning

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zha ng, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy, pages 141–159. IEEE, 2021

  6. [6]

    Eternal sunshine of the spotless net: Selective forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020

  7. [7]

    Towards unbounded machine unlearning

    Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. Towards unbounded machine unlearning. Advances in neural information processing systems, 36:1957–1987, 2023

  8. [8]

    arXiv preprint arXiv:2404.05868 , year=

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868, 2024

  9. [9]

    Salun: Empowering machine unlearning via gradient -based weight saliency in both image classification and generation

    Chongyu Foster, Qilong Zhang, Mingfu Roy, Yuanshun Yao, Zhengzhong Liu, and Yang Liu. Salun: Empowering machine unlearning via gradient -based weight saliency in both image classification and generation. In International Conference on Learning Representations, 2024

  10. [10]

    Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, et al

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann- Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In Proceedings of Machine Learning Research, 2024

  11. [11]

    Alphaedit: Null-space constrained knowledge editing for language models

    Junfeng Fang, Houcheng Luo, Kun Wang, Ruobing Li, Xiang Wang, Aixin Zhang, and Xiangnan He. Alphaedit: Null-space constrained knowledge editing for language models. In International Conference on Learning Representations, 2024

  12. [12]

    Tofu: A task of fictitious unlearning for llms

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. In First Conference on Language Modeling, 2024

  13. [13]

    Large language model unlearning

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. In Advances in Neural Information Processing Systems, 2024

  14. [14]

    and Yang, D

    Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms, 2023. URL https://arxiv.org/abs/2310.20150

  15. [15]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389–14408, 2023

  16. [16]

    Who’s harry potter? approximate unlearning in llms, 2023.URL https://arxiv

    Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms, 2023. URL https://arxiv. org/abs/2310.02238, 1(2):8, 2024

  17. [17]

    Zero-shot machine unlearning

    Vikram S Chundawat , Ayush K Tarun, Murari Mandal, and Mohan Kankanhalli. Zero-shot machine unlearning. IEEE Transactions on Information Forensics and Security, 18:2345–2354, 2023

  18. [18]

    Descent-to-delete: Gradient-based methods for machine unlearning

    Seth Neel, Aaron Roth, and Saeed Sharifi -Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In International Conference on Algorithmic Learning Theory, pages 931–962. PMLR, 2021

  19. [19]

    Amnesiac machine learning

    Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516–11524, 2021

  20. [20]

    Gptq: Accurate post-training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023

  21. [21]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXixx, 36, 2023. 11

  22. [22]

    Awq: Activation- aware weight quantization for on -device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Guangxuan Xiao, and Song Han. Awq: Activation- aware weight quantization for on -device llm compression and acceleration. GetMobile: Mobile Comp. and Comm., 28(4):12–17, January 2025. ISSN 2375-0529. doi: 10.1145/3714983.3714987. URL https://doi.org/10.1145/3714983.3714987

  23. [23]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: dense-and-sparse quantization. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  24. [24]

    Squeezellm: Dense-and-sparse quan- tization,

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023

  25. [25]

    arXiv preprint arXiv:2410.16454 , year=

    Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. Catastrophic failure of LLM unlearning via quantization. In International Conference on Learning Representations, pages 54940–54963, 2025. arXiv:2410.16454

  26. [26]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In International Conference on Learning Representations, 2023

  27. [27]

    A survey of machine unlearning

    Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. ACM Transactions on Intelligent Systems and Technology, 16(5):1–46, 2025

  28. [28]

    Machine unlearning: Solutions and challenges

    Jie Xu, Zihan Wu, Cong Wang, and Xiaohua Jia. Machine unlearning: Solutions and challenges. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023

  29. [29]

    Towards safer large language models through machine unlearning

    Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. Towards safer large language models through machine unlearning. In Findings of the Association for Computational Linguistics: ACL 2024, pages 1817–1829, 2024

  30. [30]

    Avoiding copyright infringement via large language model unlearning

    Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, and Eric Wong. Avoiding copyright infringement via large language model unlearning. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 5176–5200, 2025

  31. [31]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

  32. [32]

    Editing large language models: Problems, methods, and opportunities

    Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and opportunities. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10222–10240, 2023

  33. [33]

    In-context unlearning: Language models as few shot unlearners.arXiv preprint arXiv:2310.07579,

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. In arXiv preprint arXiv:2310.07579, 2023

  34. [34]

    Knowledge sanitization of large language models

    Yoichi Ishibashi and Hidetoshi Shimodaira. Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852 , 2023. URL https://arxiv.org/abs/2309.11852

  35. [35]

    Quantization and training of neural networks fo r efficient integer - arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks fo r efficient integer - arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

  36. [36]

    Quantized neural networks: Training neural networks with low precision weights and activations

    Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:1–30, 2018

  37. [37]

    Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016. URL https://arxiv.org/abs/1510.00149

  38. [38]

    Learned step size quantization

    Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. In International Conference on Learning Representations, 2020

  39. [39]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  40. [40]

    Sharpness-aware minimization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. 12 √

  41. [41]

    Inexact unlearning needs more careful evaluations to avoid a false sense of privacy

    Jamie Hayes, Ilia Shumailov, Eleni Triantafillou, Amr Khalifa, and Nicolas Papernot. Inexact unlearning needs more careful evaluations to avoid a false sense of privacy. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 497–519. IEEE, 2025

  42. [42]

    Weinberger

    Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q. Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=huo8MqVH6t

  43. [43]

    URLhttps://openreview.net/forum?id=J5IRyTKZ9s

    Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. CoRR, abs/2402.16835, 2024. URL https://doi.org/10.48550/ arXiv.2402.16835

  44. [44]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  45. [45]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy, pages 1897–1914. IEEE, 2022

  46. [46]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310–1320. PMLR, 2019

  47. [47]

    Muse: Machine unlearning six-way evaluation for language models

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Yanyan Liang, Daogao He, Luke Zettlemoyer, Noah A Smith, and Chiyuan Yu. Muse: Machine unlearning six-way evaluation for language models. In International Conference on Learning Representations, 2023

  48. [48]

    Alpaca: A strong, replicable instruction-following model

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023. A Hyperparameter details Hyperparameter Value Optimize...

  49. [49]

    Claims Answer: [Yes] Justification: All central claims of the paper are explicitly grounded in empirical evidence and are consistently supported across multiple independent experiments. The identification of INT4 quantization as a recovery attack is demonstrated through controlled comparisons between pre- and post-quantization model behaviors, showing sys...

  50. [50]

    Limitations Answer: [Yes] Justification: The paper provides a transparent and comprehensive discussion of its lim - itations, ensuring that the scope of conclusions is clearly bounded. First, the observed degradation in retain accuracy (RA) under strong unlearning is explicitly ackn owledged, and is framed as an inherent trade -off rather than a flaw of t...

  51. [51]

    Empirical Result

    Theory assumptions and proofs Answer: [Yes] Justification: All theoretical components are presented with explicit assumptions and a clearly delineated scope. The recovery bound proposition is derived under an L-smoothness assumption on the loss function, which is standard in optimization theory and is explicitly stated prior to the derivation. The proof i...

  52. [52]

    Section 3 specifies the complete training and evaluation protocol, including dataset preprocessing, model initialization, and evaluation metrics

    Experimental result reproducibility 16 ± Answer: [Yes] Justification: The paper provides a fully reproducible experimental pipeline with all nec - essary details disclosed. Section 3 specifies the complete training and evaluation protocol, including dataset preprocessing, model initialization, and evaluation metrics. All hyperpa- rameters, including learn...

  53. [53]

    The reposi- tory includes a complete implementation of the proposed method, baseline methods, and evaluation scripts

    Open access to data and code Answer: [Yes] Justification: All code required to reproduce the experiments is provided in an anonymized supplementary archive, ensuring compliance with double-blind review policies. The reposi- tory includes a complete implementation of the proposed method, baseline methods, and evaluation scripts. A structured README file gu...

  54. [54]

    The paper specifies the model architecture, including the base LLM and any modifications introduced by the method

    Experimental setting/details Answer: [Yes] Justification: The experimental setup is described in sufficient detail to allow exact replica- tion and critical evaluation. The paper specifies the model architecture, including the base LLM and any modifications introduced by the method. Training procedures, including the number of epochs, optimization algorit...

  55. [55]

    Multi-seed experiments are conducted using three independent random seeds, and results are reported as mean standard deviation

    Experiment statistical significance Answer: [Yes] Justification: The paper reports statistical measures to ensure that results are not due to random variation. Multi-seed experiments are conducted using three independent random seeds, and results are reported as mean standard deviation. This applies to both the proposed method and key baselines such as Sa...

  56. [56]

    Key components, such as warm -up schedul- ing, straight -through estimator (STE) coverage, and the regularization parameter λ, are systematically varied

    Ablations Answer: [Yes] Justification: The paper includes a comprehensive ablation study to isolate the contribution of each component of the proposed method. Key components, such as warm -up schedul- ing, straight -through estimator (STE) coverage, and the regularization parameter λ, are systematically varied. Each ablation experiment is conducted under ...

  57. [57]

    While it identifies quantization as a potential recovery pathway, this is framed in the context of vulnerability analysis and mitigation

    Safeguards Answer: [N/A] Justification: The paper focuses on defensive techniques for improving the robustness of machine unlearning and does not introduce new attack mechanisms intended for misuse. While it identifies quantization as a potential recovery pathway, this is framed in the context of vulnerability analysis and mitigation. No tools or datasets...

  58. [58]

    The LLaMA-3 model is used under the Meta Research License, and access is obtained through authorized channels

    Licenses Answer: [Yes] Justification: All datasets, models, and libraries used in the paper comply with their respec- tive licenses. The LLaMA-3 model is used under the Meta Research License, and access is obtained through authorized channels. The TOFU dataset is released under the MIT license and is freely accessible. The bits-and-bytes library used for ...

  59. [59]

    These include implementations of the proposed DurableUn method, evaluation scripts, and baseline integrations

    New assets Answer: [Yes] Justification: The paper introduces new code assets available in the supplementary material. These include implementations of the proposed DurableUn method, evaluation scripts, and baseline integrations. The codebase is documented with a README file that provides step-by-step instructions for reproducing the issue. Dependencies ar...

  60. [60]

    Crowdsourcing / human subjects Answer: [N/A]

  61. [61]

    It is not used as a generative tool in the methodology or for producing results

    LLM usage Answer: [N/A] Justification: The LLaMA-3-8B model is used purely as an experimental subject to evaluate unlearning and quantization effects. It is not used as a generative tool in the methodology or for producing results. No human-facing outputs are generated using the model. Therefore, no additional disclosure related to LLM usage is required