pith. sign in

arxiv: 2606.02920 · v1 · pith:D53WJLUYnew · submitted 2026-06-01 · 💻 cs.LG

Fast Unlearning at Scale via Margin Self-Correction

Pith reviewed 2026-06-28 15:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords machine unlearninglanguage model unlearninglogit marginonline stopping criterionforget-retain trade-offTOFU benchmarkMUSE benchmarkefficient fine-tuning
0
0 comments X

The pith

MASC reduces logit gaps on forget sequences until an average threshold is met, providing an online stop for unlearning without downstream checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MASC as a method to update language models so they behave as if they had never seen certain training sequences, while keeping performance on other data. It works by shrinking the logit difference between the original next token and its closest alternatives during fine-tuning on the sequences to forget. Training halts automatically once this gap falls below a level across most token positions in the forget set, removing the need to save checkpoints and test them later. Experiments on TOFU, MUSE News, and MUSE Books show this yields forget-retain balances comparable to prior methods but with far less total computation. The same approach also shows better retain utility as model size grows.

Core claim

Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences.

What carries the argument

MArgin Self-Correction (MASC), an online stopping rule based on shrinking the average logit gap over forget-sequence token positions.

If this is right

  • Unlearning runs can end without storing multiple checkpoints or running repeated validation evaluations.
  • The forget-retain trade-off improves as the number of model parameters increases, with forget metrics holding steady and retain utility rising.
  • The method applies across text unlearning benchmarks including TOFU, MUSE News, and MUSE Books while using only a fraction of baseline compute.
  • Training stops at the desired trade-off point rather than continuing past it and then selecting afterward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internal signals such as logit gaps may serve as sufficient proxies for unlearning success in other sequence models if similar margin definitions can be constructed.
  • The scaling observation suggests that parameter count itself may help separate forget and retain behavior under margin-based updates.
  • The stopping rule could be adapted to streaming or continual unlearning settings where new forget requests arrive over time.

Load-bearing premise

That shrinking the average logit gap on forget sequences by itself produces successful unlearning that keeps retain utility high without any separate downstream checks.

What would settle it

Run MASC to its stopping point on a forget set and then measure whether the model can still generate the original forgotten sequences at high probability or whether retain-task accuracy has fallen below the level of fixed-budget baselines.

Figures

Figures reproduced from arXiv: 2606.02920 by Alexander Shevchenko, Fanny Yang, Federico Di Gennaro.

Figure 1
Figure 1. Figure 1: Top: Wall-clock runtimes (in seconds) of methods with similar retain–forget trade-off. Bottom: Forget–retain trade-offs of timed methods. Each metric is in [0, 1] (the higher the better), and MASC is competitive (i.e. not Pareto dominated) with the others. predefined and fixed compute budget, which is both inefficient (cf [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left:) Correlation between stopping statistic Vˆ ρ and forget–retain trade-off. (Middle:) Pareto frontier (top-right is the ideal) of MASC training dynamics. (Right:) Vˆ ρ versus number of forget tokens seen during training for three different unlearning methods. [Plots refer to TOFU]. where Ty is the length of sequence y. In particular, Dval is drawn at random as a small off-batch subset of Dfg (of size … view at source ↗
Figure 3
Figure 3. Figure 3: Learning stage. Fitted scaling laws (log-log plot). Improving the efficiency of unlearning methods is a neces￾sary step toward scaling these procedures to increasingly large models. However, to the best of our knowledge, there is no systematic study on how model size affects unlearning efficacy. Some related recent studies suggest that memorization increases with scale during training [6, 31, 36] so that l… view at source ↗
Figure 4
Figure 4. Figure 4: Unlearning stage. Cross-scale behavior after unlearning for MASC and SimNPO. All the metrics are plotted such that the higher the better. Interestingly, the picture changes after un￾learning. Despite different starting memo￾rization profiles (cf [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto Frontier computed from the metrics of Tables 1 and 2 where top-right is better. For [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the learning rate on the MASC stopping statistic [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fitted scaling trends for MASC and SimNPO after unlearning. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Language-model unlearning updates a trained model to behave as if it had not seen selected training examples, while preserving utility and avoiding costly retraining. Existing approaches typically fine-tune the pretrained model with a fixed training budget and select the final model afterwards by evaluating several saved checkpoints on downstream validation data. Two sources of unnecessary computation limit scalability: training beyond the desired forget-retain trade-off, and checkpoint selection that requires extra storage and repeated evaluations. To address these limitations, we introduce MArgin Self-Correction (MASC), an efficient unlearning method with an online stopping rule that does not require downstream evaluation. Given a text sequence to be forgotten, MASC actively reduces the logit gap between the original next token and the most likely alternatives. It outputs a final model once this gap is small on average over a sufficiently large proportion of token positions across all forget sequences. On TOFU, MUSE News, and MUSE Books, MASC achieves a competitive forget-retain trade-off at a fraction of the computational cost of existing baselines. We further observe that as we increase model size (a.k.a. number of parameters), the trade-offs improve for both MASC and SimNPO -- the forget metrics remain comparable while retain utility increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Margin Self-Correction (MASC), an unlearning algorithm for language models that iteratively reduces the logit gap between the original next token and the most likely alternatives on forget sequences. It introduces an online stopping rule that halts training once this gap falls below a threshold on a large fraction of token positions across forget data, without requiring downstream retain-set evaluations or checkpoint storage. The central empirical claim is that MASC achieves a competitive forget-retain trade-off on TOFU, MUSE News, and MUSE Books at substantially lower computational cost than existing baselines, with both MASC and SimNPO showing improved trade-offs as model scale increases.

Significance. If the online stopping rule can be shown to reliably preserve retain utility without post-hoc downstream checks, MASC would represent a meaningful advance in scalable unlearning by removing the need for repeated evaluations and extra storage. The reported scaling trend (stable forget metrics with rising retain utility) is also potentially important if substantiated. The absence of quantitative metrics, error bars, ablations, and exact implementation details in the abstract, however, prevents assessment of whether these benefits are realized.

major comments (2)
  1. [Abstract] Abstract: the claim that MASC 'achieves a competitive forget-retain trade-off' supplies no numerical values, error bars, exact implementation details, or ablation studies, so the data-to-claim link cannot be evaluated.
  2. [Abstract / §3] Stopping rule (described in Abstract and §3): the criterion is defined solely from the average logit gap on forget sequences and does not incorporate any retain-set signal; because the update direction is derived exclusively from forget data, nothing in the construction prevents incidental degradation of retain logits, yet the manuscript reports only post-hoc retain numbers rather than demonstrating that the chosen threshold lands before retain degradation begins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that MASC 'achieves a competitive forget-retain trade-off' supplies no numerical values, error bars, exact implementation details, or ablation studies, so the data-to-claim link cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by quantitative details. In the revised version we will report specific forget and retain metrics (e.g., exact scores on TOFU, MUSE News, and MUSE Books), computational cost reductions relative to baselines, and the observed scaling trend with model size. Key implementation parameters and a brief note on multiple-run variability will also be added. revision: yes

  2. Referee: [Abstract / §3] Stopping rule (described in Abstract and §3): the criterion is defined solely from the average logit gap on forget sequences and does not incorporate any retain-set signal; because the update direction is derived exclusively from forget data, nothing in the construction prevents incidental degradation of retain logits, yet the manuscript reports only post-hoc retain numbers rather than demonstrating that the chosen threshold lands before retain degradation begins.

    Authors: The stopping rule is deliberately constructed from forget data alone to remove the need for retain-set evaluations and checkpoint storage. While the construction itself provides no theoretical guarantee against retain degradation, our experiments show that the chosen threshold yields competitive retain utility. We will add an analysis (new figure or table) that plots retain performance as a function of the stopping threshold to explicitly demonstrate that the selected operating point precedes measurable retain degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed method

full rationale

The paper presents MASC as an algorithmic procedure that reduces logit gaps on forget sequences and stops when the average gap falls below threshold on a sufficient fraction of positions; the forget-retain trade-off is then reported via separate post-hoc evaluation on retain benchmarks. No derivation chain reduces a claimed prediction or result to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing self-citation or imported uniqueness theorem appears in the provided text. The stopping rule is a heuristic defined directly on forget data, with retain utility assessed independently afterward, rendering the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or additional axioms are stated beyond the core domain assumption of the method.

axioms (1)
  • domain assumption Reducing the logit gap between the original next token and alternatives on forget sequences is sufficient to achieve unlearning while preserving utility.
    This premise underpins the online stopping rule and is invoked to replace downstream validation.

pith-pipeline@v0.9.1-grok · 5747 in / 1141 out tokens · 27528 ms · 2026-06-28T15:19:24.816833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  2. [2]

    Machine unlearning

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In2021 IEEE symposium on security and privacy (SP), pages 141–159. IEEE, 2021

  3. [3]

    California consumer privacy act of 2018, AB 375, 2018

    California Legislature. California consumer privacy act of 2018, AB 375, 2018. URL https: //ca.gov. Cal. Civ. Code§1798.100 - 1798.199

  4. [4]

    Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

    Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao, et al. Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems, 37:98213–98263, 2024

  5. [5]

    Towards making systems forget with machine unlearning

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE, 2015

  6. [6]

    Quantifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, 2022

  7. [7]

    Unlearn what you want to forget: Efficient unlearning for llms

    Jiaao Chen and Diyi Yang. Unlearn what you want to forget: Efficient unlearning for llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12041–12052, 2023

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    On effects of steering latent representation for large language model unlearning

    Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. On effects of steering latent representation for large language model unlearning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23733–23742, 2025

  10. [10]

    Undial: Self-distillation with adjusted logits for robust unlearning in large language models

    Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli´c. Undial: Self-distillation with adjusted logits for robust unlearning in large language models. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), ...

  11. [11]

    Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics

    Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lipton, and Pratyush Maini. Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net...

  12. [12]

    arXiv preprint arXiv:2310.02238 , year=

    Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms, 2023. URLhttps://arxiv.org/abs/2310.02238

  13. [13]

    Con- strained entropic unlearning: A primal-dual framework for large language models

    Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, and Mahyar Fazlyab. Con- strained entropic unlearning: A primal-dual framework for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=ZtB34bQI54

  14. [14]

    Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond

    Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=zZjLv6F0Ks

  15. [15]

    Simplicity prevails: Rethinking negative preference optimization for LLM unlearning

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=JbvSQm5h1l. 11

  16. [16]

    Dissecting fine-tuning unlearning in large language models

    Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang. Dissecting fine-tuning unlearning in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3933–3941, 2024

  17. [17]

    The euro- pean union general data protection regulation: what it is and what it means.Information & communications technology law, 28(1):65–98, 2019

    Chris Jay Hoofnagle, Bart Van Der Sloot, and Frederik Zuiderveen Borgesius. The euro- pean union general data protection regulation: what it is and what it means.Information & communications technology law, 28(1):65–98, 2019

  18. [18]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  19. [19]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

  20. [20]

    Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

    Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference.Advances in Neural Information Processing Systems, 37:12581–12611, 2024

  21. [21]

    Copyright violations and large language models

    Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403–7412, 2023

  22. [22]

    Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models

    Aly Kassem, Omar Mahmoud, and Sherif Saad. Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4360–4379, 2023

  23. [23]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  24. [24]

    Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman- Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner

    Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish K Kamath, Jacob Goldman- Wetzler, Bryce Woodworth, Alex Cloud, and Alexander Matt Turner. Distillation robustifies unlearning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  25. [25]

    URLhttps://openreview.net/forum?id=UTGjik64IK

  26. [26]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

  27. [27]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun ...

  28. [28]

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B Breuer, Andy Z...

  29. [29]

    Rethinking machine unlearning for large language models.Nature Machine Intelligence, 7(2):181–194, 2025

    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models.Nature Machine Intelligence, 7(2):181–194, 2025

  30. [30]

    Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge

    Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge. arXiv preprint arXiv:2404.05880, 2024

  31. [31]

    Quark: Controllable text generation with reinforced unlearning

    Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022

  32. [32]

    Scaling laws for fact memorization of large language models

    Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuan-Jing Huang, and Xipeng Qiu. Scaling laws for fact memorization of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11263–11282, 2024

  33. [33]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

  34. [34]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024

  35. [35]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989

  36. [36]

    Did the neurons read your book? document-level membership inference for large language models

    Matthieu Meeus, Shubham Jain, Marek Rei, and Yves-Alexandre de Montjoye. Did the neurons read your book? document-level membership inference for large language models. In33rd USENIX Security Symposium (USENIX Security 24), pages 2369–2385, 2024

  37. [37]

    How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

    John X Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G Edward Suh, Alexander M Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832, 2025

  38. [38]

    Using an llm to help with code understanding

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  39. [39]

    In-context unlearning: Language models as few-shot unlearners

    Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceed- ings of Mach...

  40. [40]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  41. [41]

    RL’s razor: Why online reinforcement learning forgets less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s razor: Why online reinforcement learning forgets less. InThe Fourteenth International Conference on Learning Representations,

  42. [42]

    URLhttps://openreview.net/forum?id=7HNRYT4V44

  43. [43]

    Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

  44. [44]

    Muse: Machine unlearning six- way evaluation for language models.arXiv preprint arXiv:2407.06460, 2024

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang. Muse: Machine unlearning six-way evaluation for language models.arXiv preprint arXiv:2407.06460, 2024. 13

  45. [45]

    Beyond memorization: Violating privacy via inference with large language models

    Robin Staab, Mark Vero, Mislav Balunovic, and Martin Vechev. Beyond memorization: Violating privacy via inference with large language models. InThe Twelfth International Conference on Learning Representations, 2024

  46. [46]

    Tamper-resistant safeguards for open-weight LLMs

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLMs. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id...

  47. [47]

    Guardrail baselines for unlearning in llms.arXiv preprint arXiv:2403.03329, 2024

    Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. Guardrail baselines for unlearning in llms.arXiv preprint arXiv:2403.03329, 2024

  48. [48]

    Unilogit: Robust Machine Unlearning for LLM s Using Uniform-Target Self-Distillation

    Stefan Vasilev, Christian Herold, Baohao Liao, Seyyed Hadi Hashemi, Shahram Khadivi, and Christof Monz. Unilogit: Robust machine unlearning for LLMs using uniform-target self- distillation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 2245...

  49. [49]

    Rkld: Reverse kl-divergence- based knowledge distillation for unlearning personal information in large language models

    Bichen Wang, Yuzhe Zi, Yixin Sun, Yanyan Zhao, and Bing Qin. Rkld: Reverse kl-divergence- based knowledge distillation for unlearning personal information in large language models. arXiv preprint arXiv:2406.01983, 2024

  50. [50]

    GRU: Mitigating the trade-off between unlearning and retention for LLMs

    Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, and Bo Han. GRU: Mitigating the trade-off between unlearning and retention for LLMs. InForty-second Inter- national Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=EAjhGr1Oeo

  51. [51]

    BloombergGPT: A Large Language Model for Finance

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564, 2023

  52. [52]

    Large language model unlearning.Advances in Neural Information Processing Systems, 37:105425–105475, 2024

    Yuanshun Yao and Xiaojun Xu. Large language model unlearning.Advances in Neural Information Processing Systems, 37:105425–105475, 2024

  53. [53]

    Negative preference optimization: From catastrophic collapse to effective unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=MXLBXjQkmb

  54. [54]

    Catastrophic failure of LLM unlearning via quantization

    Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. Catastrophic failure of LLM unlearning via quantization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=lHSeDYamnz

  55. [55]

    target versus nearest rivals

    Yisheng Zhong, Zhengbang Yang, and Zhuangdi Zhu. DUET: Distilled LLM unlearning from an efficiently contextualized teacher. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=Xa6QRrXrKX. 14 A Additional related work Beyond the likelihood-reversal and preference-optimization baselines considered ...