Measuring the Depth of LLM Unlearning via Activation Patching

Dohyun Kim; Jaemin Jo; Jaeung Lee

arxiv: 2605.24614 · v1 · pith:TGXTKYSZnew · submitted 2026-05-23 · 💻 cs.CL · cs.AI· cs.LG

Measuring the Depth of LLM Unlearning via Activation Patching

Jaeung Lee , Dohyun Kim , Jaemin Jo This is my paper

Pith reviewed 2026-06-30 13:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM unlearningactivation patchingevaluation metricsmechanistic depthresidual knowledgeprivacy protectionAI safetywhite-box evaluation

0 comments

The pith

The Unlearning Depth Score quantifies how completely target knowledge has been erased from an LLM's internal layers using activation patching against a retain baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can retain target knowledge in their internal states even after unlearning methods make outputs appear erased. Existing output-based metrics miss this residual knowledge while many white-box alternatives require extra training or specific datasets. The paper introduces UDS to first locate the layers that encode the target knowledge via a retain model baseline then score how much of that knowledge was removed on a zero-to-one scale. A meta-evaluation on 150 models from eight unlearning methods shows UDS ranks highest in faithfulness and robustness compared with twenty other metrics. If the approach holds it supplies a general tool for verifying that privacy or safety goals have been met at the mechanistic level.

Core claim

UDS identifies layers encoding target knowledge by comparing activations in the unlearned model to those in a retain model baseline then computes the fraction of that knowledge removed after unlearning to produce a 0-1 depth score. Across 150 unlearned models spanning eight methods UDS shows the highest faithfulness and robustness among twenty evaluated metrics while case studies indicate that white-box metrics disagree at the layer level and that erasure depth varies by example.

What carries the argument

The Unlearning Depth Score (UDS), a metric that locates layers holding target knowledge via activation patching against a retain model and then measures the fraction erased on a 0-1 scale.

If this is right

White-box metrics can produce conflicting layer-level diagnoses of the same unlearned model.
Erasure depth differs across individual examples even within one unlearning run.
UDS supplies a single scalar that can be added to existing unlearning benchmarks.
Evaluation pipelines can be streamlined by replacing multiple output metrics with one mechanistic score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could target specific layers identified by UDS to make future unlearning methods more efficient.
The same patching technique might expose residual knowledge in other model-editing tasks such as fact correction.
If UDS consistently reports shallow erasure then current methods may need redesign to reach deeper layers.
Benchmarks could track UDS trends over successive model releases to measure progress in unlearning reliability.

Load-bearing premise

Activation patching against a retain model baseline can locate the exact layers that encode the target knowledge and measure its removal without introducing artifacts from the unlearning process itself.

What would settle it

Run UDS on an unlearned model where output behavior matches a fully retained model yet internal activations still differ from the retain baseline at the identified layers or where outputs appear erased but UDS reports near-zero depth.

Figures

Figures reproduced from arXiv: 2605.24614 by Dohyun Kim, Jaemin Jo, Jaeung Lee.

**Figure 1.** Figure 1: Overview of UDS for a single forget set example. (A) Stage 1 patches hidden states from Mret into Mfull at each layer to measure how deeply the forget set knowledge is encoded. (B) Stage 2 repeats this with Munl as source to quantify how much encoded knowledge remains recoverable. (C) Stage 2 degradation is compared against Stage 1 at each layer to compute erasure ratios, which are weighted and aggregated … view at source ↗

**Figure 2.** Figure 2: Quantization test for Truth Ratio and ROUGE. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Mean S1 patching delta per layer for four [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Per-example UDS vs. entity token count (lr=2e-5, epoch 10; UNDIAL uses lr=1e-4). RMU variants differ by the target layer l at which the steering loss is applied (L5, L10, L15). All methods show |ρ| < 0.24 with mixed sign, indicating no consistent directional bias. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Quantization robustness (Q) for all 20 metrics after utility and faithfulness filtering. Each subplot plots the metric value before (x) vs. after (y) NF4 4-bit quantization, with the number of models showing recovery (rec) or destruction (des). n is the number of models that passed both filters for that metric. The background gradient indicates deviation from the y = x reference: white = stable, red = unst… view at source ↗

**Figure 7.** Figure 7: Quantization robustness (Q) for all 20 metrics after utility filtering only. Each subplot plots the metric value before (x) vs. after (y) NF4 4-bit quantization, with the number of models showing recovery (rec) or destruction (des). n is the number of models that passed the utility filter for that metric. The background gradient indicates deviation from the y = x reference: white = stable, red = unstable. … view at source ↗

**Figure 8.** Figure 8: Relearning robustness (R) for all 20 metrics after utility and faithfulness filtering. Each subplot plots the metric value before (x) vs. after (y) one epoch of relearning, with the number of models showing over-recovery (over) or under-recovery (under). n is the number of models that passed both filters for that metric. The dashed line shows y = x + ∆ret (expected behavior given the retain model’s shift);… view at source ↗

read the original abstract

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UDS is a new patching-based score for unlearning depth that moves past output metrics, but the meta-evaluation superiority claim needs more transparent validation of the retain baseline.

read the letter

The paper's main contribution is the Unlearning Depth Score, which locates layers encoding a target fact by patching activations from a retain-model baseline into the unlearned model, then scores erasure on a 0-1 scale. This is genuinely new as a general metric; prior work either stayed at outputs or required extra training.

It does a few things cleanly. The authors run a large meta-evaluation across 150 models and eight unlearning methods, compare against 20 other metrics, and release code plus data. The case studies showing that white-box metrics can disagree at specific layers and that erasure depth varies by example are practical and worth having. They also give concrete guidelines for adding UDS to existing benchmarks.

The soft spots are mostly around the central claim. UDS depends on the retain model to define which layers matter, and the stress-test note is right that any distribution shift or patching artifact from the unlearning process itself could bias the layer selection and therefore the whole faithfulness ranking. The abstract does not spell out how layers are chosen or whether the baseline is run on the same prompts, so it is difficult to judge how much the reported superiority over other metrics is robust versus sensitive to those choices. The 150-model scale is good, but without seeing the exact selection procedure or ablation on the baseline, the evidence for UDS being the most reliable remains provisional.

This is for groups already running unlearning experiments who want an internal-representation check beyond accuracy or membership inference. It is not reshaping the field but it adds a usable tool. The work shows clear engagement with the evaluation problem and ships reproducible artifacts, so it deserves a serious referee even if the validation section will likely need expansion.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of LLM unlearning via activation patching. UDS identifies layers encoding target knowledge using a retain-model baseline, then measures erasure on a 0-1 scale. The central claim, based on a meta-evaluation across 20 metrics and 150 unlearned models spanning 8 methods, is that UDS exhibits the highest faithfulness and robustness, making it the most reliable evaluation approach; case studies also note layer-level disagreements among white-box metrics.

Significance. If the claims hold after addressing baseline assumptions, this would supply a generalizable causal metric for unlearning evaluation that improves on output-level approaches and auxiliary-training-dependent white-box methods. The scale of the meta-evaluation (150 models, 8 methods) and public code/data release are positive features that could support adoption in AI safety benchmarks.

major comments (2)

Abstract: the superiority claim in the meta-evaluation treats UDS as the reference for faithfulness, yet UDS layer identification depends on the retain-model baseline; this assumption is load-bearing because any unlearning-induced distribution shift or patching artifact would propagate into the faithfulness ranking, and the manuscript provides no explicit test isolating this risk.
Abstract: the meta-evaluation reports UDS as highest in faithfulness and robustness, but without details on the exact definition of 'faithfulness' (e.g., correlation with ground-truth erasure or recovery experiments) or how post-hoc layer selection was validated, the claim that the causal approach is 'most reliable' cannot be assessed from the given description.

minor comments (2)

The abstract states that 'erasure depth varies across examples'; including quantitative variation statistics or a table of per-example UDS scores would clarify this observation.
Clarify whether the retain model is the original pretrained model or a separately trained model, and state any hyperparameters used in the patching procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and note planned revisions to improve clarity and address the identified gaps.

read point-by-point responses

Referee: Abstract: the superiority claim in the meta-evaluation treats UDS as the reference for faithfulness, yet UDS layer identification depends on the retain-model baseline; this assumption is load-bearing because any unlearning-induced distribution shift or patching artifact would propagate into the faithfulness ranking, and the manuscript provides no explicit test isolating this risk.

Authors: We acknowledge that the retain-model baseline is a core assumption for layer identification in UDS, and that the manuscript does not include an explicit isolation experiment for potential distribution shifts or patching artifacts. While the scale of the meta-evaluation (150 models across 8 methods) provides supporting evidence of robustness, we agree this is a substantive concern. In revision we will add a targeted analysis testing UDS sensitivity to baseline perturbations and discuss the implications for the faithfulness ranking. revision: yes
Referee: Abstract: the meta-evaluation reports UDS as highest in faithfulness and robustness, but without details on the exact definition of 'faithfulness' (e.g., correlation with ground-truth erasure or recovery experiments) or how post-hoc layer selection was validated, the claim that the causal approach is 'most reliable' cannot be assessed from the given description.

Authors: The abstract summarizes results concisely, but the full manuscript defines faithfulness via correlation with ground-truth erasure (verified through recovery experiments) and validates layer selection through consistency checks across methods. We agree the abstract claim would benefit from a brief definition or cross-reference. We will revise the abstract to include a short clarification of these terms while preserving length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines UDS via activation patching against a retain-model baseline and reports its superiority via a meta-evaluation on 150 models and 20 metrics. No quoted step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology. The derivation remains self-contained against the external benchmark of the meta-evaluation; no load-bearing self-citation or ansatz smuggling is exhibited in the text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on activation patching revealing causal knowledge location and the retain model providing an unbiased layer baseline; no free parameters or invented entities beyond the metric itself are specified in the abstract.

axioms (1)

domain assumption Activation patching can causally identify layers encoding target knowledge using a retain model baseline.
Invoked in the abstract's description of how UDS first identifies layers.

invented entities (1)

Unlearning Depth Score (UDS) no independent evidence
purpose: Quantify mechanistic depth of unlearning on a 0-1 scale.
Newly proposed metric without external falsifiable evidence beyond the paper's evaluation.

pith-pipeline@v0.9.1-grok · 5748 in / 1289 out tokens · 51812 ms · 2026-06-30T13:30:59.017812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · 2 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Samyadeep Basu, Phillip Pope, and Soheil Feizi. 2021. https://openreview.net/forum?id=xHKVVHGDOEk Influence functions in deep learning are fragile . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

2021
[4]

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, and 77 others. 2025. https://arxiv.org/abs/2501.17805 Intern...

work page arXiv 2025
[5]

Choquette - Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette - Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. https://doi.org/10.1109/SP40001.2021.00019 Machine unlearning . In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021 , pages 141--159. IEEE

work page doi:10.1109/sp40001.2021.00019 2021
[6]

Brown, Dawn Song, \' U lfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tram \` e r, Eric Wallace, Matthew Jagielski, Ariel Herbert - Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, \' U lfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting Extracting training data from large language models . In 30th USENI...

2021
[7]

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ram \' o n Huerta, and Ivan Vulic. 2025. https://doi.org/10.18653/V1/2025.NAACL-LONG.444 UNDIAL : Self-distillation with adjusted logits for robust unlearning in large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis...

work page doi:10.18653/v1/2025.naacl-long.444 2025
[8]

Zico Kolter, and Pratyush Maini

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary Chase Lipton, J. Zico Kolter, and Pratyush Maini. 2025. https://arxiv.org/abs/2506.12618 Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics . In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processin...

work page arXiv 2025
[9]

Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. 2025 a . https://arxiv.org/abs/2502.05374 Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond . In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025

work page arXiv 2025
[10]

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. 2025 b . https://openreview.net/forum?id=JbvSQm5h1l Simplicity prevails: Rethinking negative preference optimization for LLM unlearning . In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025

2025
[11]

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. 2024. https://proceedings.mlr.press/v235/ghandeharioun24a.html Patchscopes: A unifying framework for inspecting hidden representations of language models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , volume 235 of Proc...

2024
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2025. https://proceedings.mlr.press/v267/guo25k.html Mechanistic unlearning: Robust knowledge unlearning and editing via mechanistic localization . In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 , volume...

2025
[14]

Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, and Mor Geva. 2025. https://doi.org/10.18653/V1/2025.EMNLP-MAIN.985 Intrinsic test of unlearning using parametric knowledge traces . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025 , pages 19513--19535. Association fo...

work page doi:10.18653/v1/2025.emnlp-main.985 2025
[15]

Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang. 2024. https://doi.org/10.18653/V1/2024.EMNLP-MAIN.228 Dissecting fine-tuning unlearning in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages 3933--3941. Associat...

work page doi:10.18653/v1/2024.emnlp-main.228 2024
[16]

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.805 Knowledge unlearning for mitigating privacy risks in language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toron...

work page doi:10.18653/v1/2023.acl-long.805 2023
[17]

Yurim Jang, Jaeung Lee, Dohyun Kim, Jaemin Jo, and Simon S. Woo. 2026. https://arxiv.org/abs/2602.18505 Suppression or deletion: A restoration-based representation-level analysis of machine unlearning . CoRR, abs/2602.18505

work page arXiv 2026
[18]

Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/b1f78dfc9ca0156498241012aec4efa0-Abstract-Datasets_and_Benchmarks_Track.html RWKU : Benchmarking real-world knowledge unlearning for large language models . In Advances in Neural ...

2024
[19]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. https://doi.org/10.1073/pnas.1611835114 Overcoming catastrophic forgetting in neural networks . Proceedings...

work page doi:10.1073/pnas.1611835114 2017
[20]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. 2019. http://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learn...

2019
[21]

Woo, and Jaemin Jo

Jaeung Lee, Suhyeon Yu, Yurim Jang, Simon S. Woo, and Jaemin Jo. 2026. https://doi.org/10.1109/TVCG.2026.3658325 Unlearning comparator: A visual analytics system for comparative evaluation of machine unlearning methods . IEEE Transactions on Visualization and Computer Graphics , 32(3):2852--2867

work page doi:10.1109/tvcg.2026.3658325 2026
[22]

Li, Ann - Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm - Burger, Rassin Lababidi, Lennart Justen, Andrew B

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann - Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm - Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, and 27 others. 2024. https://proceedings.mlr.press/v235...

2024
[23]

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield - Menell. 2024. https://doi.org/10.48550/ARXIV.2402.16835 Eight methods to evaluate robust unlearning in LLMs . CoRR, abs/2402.16835

work page doi:10.48550/arxiv.2402.16835 2024
[24]

Zico Kolter

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J. Zico Kolter. 2024. https://openreview.net/forum?id=B41hNBoWLo TOFU : A task of fictitious unlearning for LLM s . In First Conference on Language Modeling, COLM 2024

2024
[25]

Hasan, and Elita A

Anmol Reddy Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid A. Hasan, and Elita A. Lobo. 2025. https://aclanthology.org/2025.coling-main.252/ Alternate preference optimization for unlearning factual knowledge in large language models . In Proceedings of the 31st International Conference on Computational Linguist...

2025
[26]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 20...

2022
[27]

nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens Interpreting GPT : the logit lens . LessWrong

2020
[28]

Vaidehi Patil, Peter Hase, and Mohit Bansal. 2024. https://openreview.net/forum?id=7erlRDoaV8 Can sensitive information be deleted from LLMs ? objectives for defending against extraction attacks . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024
[29]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Syst...

2023
[30]

Atakan Seyito g lu, Aleksei Kuvshinov, Leo Schwinn, and Stephan G \" u nnemann. 2024. https://arxiv.org/abs/2411.02631 Extracting unlearned information from LLM s with activation steering . In NeurIPS 2024 Workshop on Safe Generative AI

work page arXiv 2024
[31]

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2024. https://openreview.net/forum?id=zWqr3MQuNs Detecting pretraining data from large language models . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024
[32]

Smith, and Chiyuan Zhang

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. 2025. https://openreview.net/forum?id=TArmA033BU MUSE : Machine unlearning six-way evaluation for language models . In The Thirteenth International Conference on Learning Representations, ICLR 2025

2025
[33]

Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. https://papers.nips.cc/paper_files/paper/2022/hash/fa0509f4dab6807e2cb465715bf2d249-Abstract-Conference.html Memorization without overfitting: Analyzing the training dynamics of large language models . In Advances in Neural Information Processing Systems 35: Annual Conferenc...

2022
[34]

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, and Haibo Hu. 2025. https://arxiv.org/abs/2505.16831 Unlearning isn't deletion: Investigating reversibility of machine unlearning in LLMs . CoRR, abs/2505.16831

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2024. http://papers.nips.cc/paper_files/paper/2024/hash/be52acf6bccf4a8c0a90fe2f5cfcead3-Abstract-Conference.html Large language model unlearning . In Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 1...

2024
[36]

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. https://doi.org/10.1109/CSF.2018.00027 Privacy risk in machine learning: Analyzing the connection to overfitting . In 31st IEEE Computer Security Foundations Symposium, CSF 2018, Oxford, United Kingdom, July 9-12, 2018 , pages 268--282. IEEE Computer Society

work page doi:10.1109/csf.2018.00027 2018
[37]

Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, and Hai Li. 2025. https://openreview.net/forum?id=ZGkfoufDaU Min-K\ In The Thirteenth International Conference on Learning Representations, ICLR 2025

2025
[38]

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. https://openreview.net/forum?id=MXLBXjQkmb Negative preference optimization: From catastrophic collapse to effective unlearning . In First Conference on Language Modeling, COLM 2024

2024

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [3]

Samyadeep Basu, Phillip Pope, and Soheil Feizi. 2021. https://openreview.net/forum?id=xHKVVHGDOEk Influence functions in deep learning are fragile . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

2021

[4] [4]

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, and 77 others. 2025. https://arxiv.org/abs/2501.17805 Intern...

work page arXiv 2025

[5] [5]

Choquette - Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette - Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. https://doi.org/10.1109/SP40001.2021.00019 Machine unlearning . In 42nd IEEE Symposium on Security and Privacy, SP 2021, San Francisco, CA, USA, 24-27 May 2021 , pages 141--159. IEEE

work page doi:10.1109/sp40001.2021.00019 2021

[6] [6]

Brown, Dawn Song, \' U lfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tram \` e r, Eric Wallace, Matthew Jagielski, Ariel Herbert - Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, \' U lfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting Extracting training data from large language models . In 30th USENI...

2021

[7] [7]

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ram \' o n Huerta, and Ivan Vulic. 2025. https://doi.org/10.18653/V1/2025.NAACL-LONG.444 UNDIAL : Self-distillation with adjusted logits for robust unlearning in large language models . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis...

work page doi:10.18653/v1/2025.naacl-long.444 2025

[8] [8]

Zico Kolter, and Pratyush Maini

Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary Chase Lipton, J. Zico Kolter, and Pratyush Maini. 2025. https://arxiv.org/abs/2506.12618 Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics . In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processin...

work page arXiv 2025

[9] [9]

Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. 2025 a . https://arxiv.org/abs/2502.05374 Towards LLM unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond . In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025

work page arXiv 2025

[10] [10]

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. 2025 b . https://openreview.net/forum?id=JbvSQm5h1l Simplicity prevails: Rethinking negative preference optimization for LLM unlearning . In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025

2025

[11] [11]

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. 2024. https://proceedings.mlr.press/v235/ghandeharioun24a.html Patchscopes: A unifying framework for inspecting hidden representations of language models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , volume 235 of Proc...

2024

[12] [12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The Llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2025. https://proceedings.mlr.press/v267/guo25k.html Mechanistic unlearning: Robust knowledge unlearning and editing via mechanistic localization . In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 , volume...

2025

[14] [14]

Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, and Mor Geva. 2025. https://doi.org/10.18653/V1/2025.EMNLP-MAIN.985 Intrinsic test of unlearning using parametric knowledge traces . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025 , pages 19513--19535. Association fo...

work page doi:10.18653/v1/2025.emnlp-main.985 2025

[15] [15]

Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang. 2024. https://doi.org/10.18653/V1/2024.EMNLP-MAIN.228 Dissecting fine-tuning unlearning in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024 , pages 3933--3941. Associat...

work page doi:10.18653/v1/2024.emnlp-main.228 2024

[16] [16]

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.805 Knowledge unlearning for mitigating privacy risks in language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toron...

work page doi:10.18653/v1/2023.acl-long.805 2023

[17] [17]

Yurim Jang, Jaeung Lee, Dohyun Kim, Jaemin Jo, and Simon S. Woo. 2026. https://arxiv.org/abs/2602.18505 Suppression or deletion: A restoration-based representation-level analysis of machine unlearning . CoRR, abs/2602.18505

work page arXiv 2026

[18] [18]

Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/b1f78dfc9ca0156498241012aec4efa0-Abstract-Datasets_and_Benchmarks_Track.html RWKU : Benchmarking real-world knowledge unlearning for large language models . In Advances in Neural ...

2024

[19] [19]

Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. https://doi.org/10.1073/pnas.1611835114 Overcoming catastrophic forgetting in neural networks . Proceedings...

work page doi:10.1073/pnas.1611835114 2017

[20] [20]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. 2019. http://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learn...

2019

[21] [21]

Woo, and Jaemin Jo

Jaeung Lee, Suhyeon Yu, Yurim Jang, Simon S. Woo, and Jaemin Jo. 2026. https://doi.org/10.1109/TVCG.2026.3658325 Unlearning comparator: A visual analytics system for comparative evaluation of machine unlearning methods . IEEE Transactions on Visualization and Computer Graphics , 32(3):2852--2867

work page doi:10.1109/tvcg.2026.3658325 2026

[22] [22]

Li, Ann - Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm - Burger, Rassin Lababidi, Lennart Justen, Andrew B

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann - Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm - Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, and 27 others. 2024. https://proceedings.mlr.press/v235...

2024

[23] [23]

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield - Menell. 2024. https://doi.org/10.48550/ARXIV.2402.16835 Eight methods to evaluate robust unlearning in LLMs . CoRR, abs/2402.16835

work page doi:10.48550/arxiv.2402.16835 2024

[24] [24]

Zico Kolter

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J. Zico Kolter. 2024. https://openreview.net/forum?id=B41hNBoWLo TOFU : A task of fictitious unlearning for LLM s . In First Conference on Language Modeling, COLM 2024

2024

[25] [25]

Hasan, and Elita A

Anmol Reddy Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid A. Hasan, and Elita A. Lobo. 2025. https://aclanthology.org/2025.coling-main.252/ Alternate preference optimization for unlearning factual knowledge in large language models . In Proceedings of the 31st International Conference on Computational Linguist...

2025

[26] [26]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. http://papers.nips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 20...

2022

[27] [27]

nostalgebraist. 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens Interpreting GPT : the logit lens . LessWrong

2020

[28] [28]

Vaidehi Patil, Peter Hase, and Mohit Bansal. 2024. https://openreview.net/forum?id=7erlRDoaV8 Can sensitive information be deleted from LLMs ? objectives for defending against extraction attacks . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024

[29] [29]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Syst...

2023

[30] [30]

Atakan Seyito g lu, Aleksei Kuvshinov, Leo Schwinn, and Stephan G \" u nnemann. 2024. https://arxiv.org/abs/2411.02631 Extracting unlearned information from LLM s with activation steering . In NeurIPS 2024 Workshop on Safe Generative AI

work page arXiv 2024

[31] [31]

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2024. https://openreview.net/forum?id=zWqr3MQuNs Detecting pretraining data from large language models . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

2024

[32] [32]

Smith, and Chiyuan Zhang

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. 2025. https://openreview.net/forum?id=TArmA033BU MUSE : Machine unlearning six-way evaluation for language models . In The Thirteenth International Conference on Learning Representations, ICLR 2025

2025

[33] [33]

Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. https://papers.nips.cc/paper_files/paper/2022/hash/fa0509f4dab6807e2cb465715bf2d249-Abstract-Conference.html Memorization without overfitting: Analyzing the training dynamics of large language models . In Advances in Neural Information Processing Systems 35: Annual Conferenc...

2022

[34] [34]

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, and Haibo Hu. 2025. https://arxiv.org/abs/2505.16831 Unlearning isn't deletion: Investigating reversibility of machine unlearning in LLMs . CoRR, abs/2505.16831

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2024. http://papers.nips.cc/paper_files/paper/2024/hash/be52acf6bccf4a8c0a90fe2f5cfcead3-Abstract-Conference.html Large language model unlearning . In Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 1...

2024

[36] [36]

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. https://doi.org/10.1109/CSF.2018.00027 Privacy risk in machine learning: Analyzing the connection to overfitting . In 31st IEEE Computer Security Foundations Symposium, CSF 2018, Oxford, United Kingdom, July 9-12, 2018 , pages 268--282. IEEE Computer Society

work page doi:10.1109/csf.2018.00027 2018

[37] [37]

Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, and Hai Li. 2025. https://openreview.net/forum?id=ZGkfoufDaU Min-K\ In The Thirteenth International Conference on Learning Representations, ICLR 2025

2025

[38] [38]

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. https://openreview.net/forum?id=MXLBXjQkmb Negative preference optimization: From catastrophic collapse to effective unlearning . In First Conference on Language Modeling, COLM 2024

2024