pith. machine review for the scientific record. sign in

arxiv: 2605.01699 · v3 · submitted 2026-05-03 · 💻 cs.LG · cs.AI· cs.CR· cs.NE

Recognition: 3 theorem links

Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CRcs.NE
keywords probe geometry alignmentcross-sequence memorizationlanguage model unlearningactivation projectioncausal interventionmemorization detectionrank-one update
0
0 comments X

The pith

A single rank-one intervention per depth removes the cross-sequence memorization signature below chance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that a consistent memorization signature exists in pretrained language models and can be detected using a leave-one-out cross-sequence probe that generalizes to held-out sequences. The signature is shown to be causally separable from normal recall processes because projecting it out reduces the probe signal substantially while leaving behavioral performance intact. Probe-geometry alignment is introduced as a method to align activations along the probe direction at each depth, driving the signature below random chance across models of different sizes. This erasure remains effective even against re-fitting adversarial probes and does not degrade zero-shot capabilities on standard benchmarks. A reader would care because it provides a targeted way to address internal memorization that survives behavioral unlearning, with potential implications for model safety and data privacy.

Core claim

The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations that is removable below chance with a single rank-one intervention per depth at no measurable capability cost. The leave-one-out probe reveals memorization-specific gaps that collapse under random initialization controls, and the direction is distinct from fine-tuning injected content. PGA applies this alignment surgically, achieving probe accuracies from 0.06 to 0.45 below chance while capabilities stay within 2.8 percentage points.

What carries the argument

The probe-geometry alignment (PGA) mechanism, which performs a rank-one update to align activations with the negative of the cross-sequence memorization probe direction at each layer.

If this is right

  • The signature is consistent across scales with specific gaps like +0.32 on Pythia-70M.
  • Projecting out the direction collapses the signature from +0.44 to -0.19 locally without changing recall much.
  • PGA drives the probe below chance at all scales and defeats re-fit attackers at relevant depths.
  • Zero-shot capability benchmarks remain within 2.8pp per task after the intervention.
  • Probes on natural memorization do not detect fine-tuning secrets, indicating separate regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This indicates that memorization can be isolated in specific linear directions within the activation space, separate from general capabilities.
  • The method may generalize to editing other model behaviors that have detectable internal signatures, such as specific biases or factual errors.
  • Applying PGA during or after training could become a standard technique for controlling what models retain from their training data.

Load-bearing premise

That the leave-one-out cross-sequence probe identifies a generalizable, causally relevant memorization direction linearly separable from other recall mechanisms, such that rank-one alignment along it leaves unrelated capabilities unaffected.

What would settle it

If a newly trained probe on the activations after PGA treatment can still detect the memorization sequences with accuracy significantly above chance, or if any of the five zero-shot capability benchmarks shows a drop larger than 2.8 percentage points.

Figures

Figures reproduced from arXiv: 2605.01699 by Anamika Paul Rupa, Anietie Andy.

Figure 1
Figure 1. Figure 1: How the paper fits together. Standard behavioural unlearning suppresses generation but leaves a recoverable representa￾tional signature. The cross-sequence LOO probe (§3) detects this signature; projecting out the probe direction collapses it locally (§3.4); PGA (§5) collapses it below random chance across four scales while preserving five zero-shot capability benchmarks within 2.8pp. accuracy. The memoriz… view at source ↗
Figure 2
Figure 2. Figure 2: The cross-sequence signature replicates across three architectures. LOO probe accuracy by residual depth on Pythia-70M, GPT-2 Medium, Mistral-7B. True-LOO (solid red) vs. pure-distinguishability null (dashed blue) and shuffled-label null (dotted gray); memorization-specific gap annotated. Protocol: Appendix A.1. know” refusal-tuning [1]). Headline: MLDU also fails the probe. MLDU modifies only the causally… view at source ↗
Figure 3
Figure 3. Figure 3: Pythia-70M unlearning comparison: PGA vs four behavioural baselines on a unified pipeline. All six methods (baseline + GA, NPO, RMU, IDK, PGA) trained and evaluated on the same memorised pool with the same probe / recall / PPL protocol; behavioural baselines are PPL-gated to prevent capability collapse. Every panel uses “bigger bar / higher curve = better” semantics, with baseline shown in dark grey across… view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer cross-sequence LOO probe across four architectures spanning ∼4 orders of magnitude. Red: baseline. Blue: post-PGA. Dashed: 0.5 random baseline; shaded region is below-chance. Below-chance collapse reproduces at toy, Pythia-70M, and Mistral-7B. top-k eigenvectors of the standardised between-class scatter Sd = (µm,d−µc,d)(µm,d−µc,d) ⊤ + (Σm,d−Σc,d). At k= 2 on L21, MD-PGA drives the probe to 0.061 … view at source ↗
Figure 5
Figure 5. Figure 5: Cross-architecture emergence under normalized depth. Memorization gap (true minus pure LOO) vs. fractional network depth across three models. The gap emerges at depth 0% in Mistral-7B, 50% in GPT-2 Medium, and peaks mid-to-late in Pythia-70M, revealing architecture-dependent depth profiles but consistent mid-network presence. A.2 Vocabulary-Matched Probe Control To directly address the token confound at em… view at source ↗
Figure 6
Figure 6. Figure 6: Vocabulary-matched cross-sequence probe on Pythia-70M ( view at source ↗
Figure 7
Figure 7. Figure 7: GPT-2 Medium cross-sequence LOO with 5 probe seeds. Red: true LOO (mem vs ctl). Shaded band: 5- seed min–max envelope (zero-width). Blue: pure-distinguishability null. Dashed gray: shuffled null. Transformer-layer mean gap +0.191 (5-seed std <0.0001), peak +0.449 (5-seed std <0.0001) at L21 view at source ↗
Figure 8
Figure 8. Figure 8: Bootstrap 95% confidence intervals on GPT-2 Medium LOO. Left: true LOO (red) and pure baseline (blue) per depth with 95% CIs from 100 bootstrap replicates. Right: memorization gap with CI [+0.184, +0.203] (width 0.019), with zero far outside the CI from L12 onward. Finding 10. Across 100 bootstrap replicates of the neutral-context pool, the GPT-2 Medium transformer-layer mean gap is +0.195 with 95% confide… view at source ↗
Figure 9
Figure 9. Figure 9: Sequence jackknife for GPT-2 Medium. Left: transformer-layer mean gap when each sequence is dropped, baseline +0.191 (dashed). Right: peak gap. All seven gaps are strictly positive; bsd_license is the largest single contributor (dropping it yields the smallest remaining gap, +0.095), but the effect survives its removal. A.6 Sequence Jackknife (Pythia-70M and Mistral-7B) We extend the leave-one-sequence-out… view at source ↗
Figure 10
Figure 10. Figure 10: Sequence jackknife for Pythia-70M (left) and Mistral-7B (right). Bars show the trans-layer mean gap when each memorised licence is dropped from the pool; dashed line is the baseline gap on all seven sequences (+0.583 Pythia, +0.699 Mistral). Every leave-one-out gap remains strictly positive on both architectures, replicating the GPT-2 Medium result ( view at source ↗
Figure 11
Figure 11. Figure 11: Mistral-7B cross-sequence LOO with 5 probe seeds. Red: true LOO (mem vs. ctl), shaded band is the 5-seed min–max envelope. Blue: pure-distinguishability (ctl vs. B). Dashed grey: shuffled-label null. Mid-layer (L3–L10) mean gap +0.355 ± 0.000; all-transformer gap +0.297 ± 0.000; peak +0.471 ± 0.000 at L11; shuffled null ≈ 0.497. Finding 13. On Mistral-7B across 5 probe seeds, the mid-layer (L3–L10) mean g… view at source ↗
Figure 12
Figure 12. Figure 12: Multi-seed stability of the Pythia-70M cross-sequence signature. Per-depth LOO accuracy averaged across 5 probe random states on the 8-sequence set. Logistic regression converges to the same global optimum regardless of initialization. True LOO plateaus at 0.75–0.98 from L2 onward, well above the pure baseline. Mean gap +0.347, peak +0.537. B.3 GPT-2 Medium: Per-Sequence LOO Results This appendix supports… view at source ↗
Figure 13
Figure 13. Figure 13: shows the five-seed panel; the visual identity across seeds reflects logistic-regression convergence to the same global optimum given the training-fold size view at source ↗
Figure 14
Figure 14. Figure 14: Per-sequence LOO accuracy heatmaps across three architectures. Rows are memorized sequences sorted by cluster (legal, web, code, placeholder); columns are residual stream depths. Cluster-specificity is directly visible: the legal cluster lights up at mid-network in all three models, while code and placeholder sequences sit near chance throughout view at source ↗
Figure 15
Figure 15. Figure 15: Probe-direction intervention on Pythia-70M. Left: cross-sequence LOO gap collapses at L4 (+0.44 → −0.19) after hook intervention at block 3, partially persists at L5, and is largely preserved at L6. Centre: log-probability drops are small (0.04–0.19 nats). Right: summary of four-quadrant interpretation view at source ↗
Figure 16
Figure 16. Figure 16: Pythia-70M residual-stream steering at L4. view at source ↗
Figure 17
Figure 17. Figure 17: Natural-memorization probe does not recognize the injected secret. Left: LOO accuracy across depths hovers at chance (0.50) for both memorized and unlearned models, indicating the natural-memorization probe does not detect the injected secret. Right: log-probability confirms behavioral erasure (P(target) drops from ∼ 0.99 to ∼ 10−31). Interpretation. This is a null result for the transfer claim but a posi… view at source ↗
Figure 18
Figure 18. Figure 18: Rapid multi-secret injection does not produce a cross-secret signature. Left: mean LOO accuracy across five injected secrets shows the pure baseline (gray) dominates all depths, indicating string-level features rather than shared memorization. Right: Orion target drops ∼ 18 nats while others remain unchanged, confirming surgical unlearning. 28 view at source ↗
Figure 19
Figure 19. Figure 19: Phase 1 training dynamics. Left: Cross-entropy loss converges near 0.063. Right: P(S0 | prefix) saturates at 1.0000 by epoch 5, confirming strong memorization view at source ↗
Figure 20
Figure 20. Figure 20: Pythia-70M residual stream causal tracing. view at source ↗
Figure 21
Figure 21. Figure 21: Embedding intervention control. Left: Behavioural erasure, all three active interventions (A, B, C) reduce P(secret) below the threshold. Right: Representational retention, the probe drops to chance level (0.500) only when the upstream token embeddings are zeroed (B). All downstream interventions (A, D) and even 50× noise injection (C) leave the probe at 0.982 view at source ↗
Figure 22
Figure 22. Figure 22: Attention pattern of L0H7 on the secret prefix. Left: before unlearning, L0H7 concentrates attention on secret positions (weight 0.35). Centre: after unlearning, attention is disrupted and redistributed. Right: difference map shows decreased attention (blue) at secret positions, confirming routing suppression view at source ↗
Figure 23
Figure 23. Figure 23: Toy-model multi-secret experiment summary across 8 independent secrets. view at source ↗
Figure 24
Figure 24. Figure 24: visualizes the per-head NCE landscape across the toy transformer’s 32 attention heads. L0H7 dominates with NCE = 1.000; the next-strongest head has NCE below 0.001, four orders of magnitude smaller, the signal is concentrated, not distributed. The contrast between the heatmap (left panel) and the rank-order plot (right panel) gives two complementary views of this concentration: the heatmap shows that only… view at source ↗
Figure 25
Figure 25. Figure 25: Split-objective unlearning dynamics over 200 epochs. LM losses remain flat; P(secret) reaches < 0.001 by epoch 40; MLP probe accuracy stays at 1.000 throughout (the dissociation); perplexity remains near baseline. E.4 Phase 4: Full Evaluation view at source ↗
Figure 26
Figure 26. Figure 26: Injection density vs. causal localization depth on Pythia-70M. Sparse injection (10 reps, red) peaks at L1 (0.55); dense injection (50 reps, blue) peaks earlier at L0 (0.66). Dense injection distributes the signal into early transformer layers, with the sparse-vs-dense distinction preserved across deterministic re-runs. and identical hyperparameters (α = 50, η = 5 × 10−4 , 150 epochs maximum). Probe accur… view at source ↗
Figure 27
Figure 27. Figure 27: Toy-model intervention breadth and projection removal. Left: all interventions achieve behavioral erasure. Centre: all methods keep probe accuracy at 1.000. Right: projection removal reduces original model separability (1.000 → 0.833) but not the unlearned model (1.000 → 1.000), showing unlearning reorganizes geometry rather than erasing the signal. 38 view at source ↗
Figure 28
Figure 28. Figure 28: MLDU vs. 2024–2025 unlearning methods on the toy transformer. Left: behavioural erasure ( view at source ↗
Figure 29
Figure 29. Figure 29: R3C feasibility sweep: no configuration satisfies the joint unlearning criteria. Each of five configurations overshoots the target log-probability drop (13–17 vs. [2, 10] nats) and causes catastrophic capability collapse (77× to 14,747× baseline). The intersection of all three success criteria is empty. probability mass on the three retain sequences at the expense of general prediction. The cap mechanism … view at source ↗
Figure 30
Figure 30. Figure 30: Cross-sequence LOO probe per residual depth for each toy-model method. PGA’s per-depth trajectory view at source ↗
Figure 31
Figure 31. Figure 31: Recall-suppression vs probe-collapse Pareto frontier for the combined-attack family (MEMIT view at source ↗
Figure 32
Figure 32. Figure 32: PGA training trajectories on the toy 9+9 model for three values of the alignment weight view at source ↗
Figure 33
Figure 33. Figure 33: CLPA training trajectory (|Htarget| = 3 heads). Alignment scalar loss drops to ∼5×10−3 across all λ (right panel), confirming alignment is achieved at the 3 causally-identified heads. PPL stays near baseline (1.38, dashed). Yet max probe plateaus at 0.83–0.89 (third panel, dashed line at 0.72 floor target), alignment at 3 of 32 heads is insufficient to collapse the probe, which integrates contributions fr… view at source ↗
Figure 34
Figure 34. Figure 34: PGA robustness against six adversarial probe variants. Left: toy model. Right: Pythia-70M. All variants stay below the 0.72 floor target within memorization-relevant layers. Two Pythia violations at L0–L1 occur at token-identity layers, which PGA does not target. Probe-variant robustness ( view at source ↗
Figure 35
Figure 35. Figure 35: PGA training stability across K = 4 LoRA-init seeds on Pythia-70M. Pre-PGA baseline (blue) is deterministic given the model checkpoint. Post-PGA (red) is reported as mean across seeds {7, 13, 42, 99} with ±1 std shaded; thin red lines show the four individual seed runs. The deterministic seed-reset patch in the LoRA training cell (resets RNGs immediately before kaiming_uniform_ adapter initialisation) ens… view at source ↗
Figure 36
Figure 36. Figure 36: PGA defeats four detector families on Pythia-70M. Per-depth comparison of baseline vs. post-PGA detection rates across (1) the trained-against linear probe, (2) Centred Kernel Alignment (CKA), (3) PCA top-k variance shift, and (4) Kolmogorov–Smirnov test on activation eigenvalue distributions. PGA collapses the linear probe at memorisation-relevant depths (L1–L6) while simultaneously rendering the CKA, PC… view at source ↗
Figure 37
Figure 37. Figure 37: MD-PGA depth/rank sweep on GPT-2 Medium. Left: post-PGA probe accuracy (0.061 at k = 1) vs. depth coverage. Right: recall cost for each configuration. The winning configuration (k = 2 at L21 only) drops the probe below chance (0.061) with negligible recall cost (≤ 0.20 nats), confirming single-depth projection as the optimal operating point. G.10 Capability Preservation: Per-Task Numbers and Convergence W… view at source ↗
Figure 38
Figure 38. Figure 38: Capability preservation under PGA on Pythia-70M. Five view at source ↗
Figure 39
Figure 39. Figure 39: Capability preservation under adversarial PGA on Pythia-70M (rank-4 at L4–L6). Left: per-task zero-shot accuracy on the same five benchmarks; mean ∆acc = +0.002, all per-task |∆| ≤ 0.029, BoolQ specifically +2.9pp. Right: adversarial-PGA convergence trajectory: max LOO probe across L4–L6 drops monotonically from 0.79 at rank-1 to 0.48 at rank-4 (below the 0.55 target) in 4 iterations. G.12 Extensions and … view at source ↗
Figure 40
Figure 40. Figure 40: Pareto frontier across four PGA variants on Pythia-70M. A: MD-PGA k = 3 (defeats probe at 2/6 depths, recall ∆=−4.46 nats); B: LR-aligned hybrid MD-PGA k= 3 (1/6, −3.21 nats); C: Adversarial PGA rank-6 (6/6 depths defeated, −0.39 nats); D: Per-fold MD-PGA k = 3 (1/6, −4.35 nats). Adversarial PGA dominates on both axes: full probe defeat at the lowest recall cost. 4. Alignment without a clean twin. Replace… view at source ↗
read the original abstract

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean {\Delta}acc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pretrained LLMs exhibit a consistent cross-sequence memorization signature in their activations, identifiable via leave-one-out probes that generalize across held-out sequences. This signature is shown to be causally separable from behavioral recall (via projection tests), distinct from natural memorization regimes, and erasable below chance using a rank-one Probe-Geometry Alignment (PGA) intervention per depth. Results are reported across scales (toy model, Pythia-70M, GPT-2 medium, Mistral-7B) with controls including random-initialization baselines, re-fitting attacker robustness, and preservation of zero-shot benchmarks (mean Δacc +0.2pp, max shift 2.8pp).

Significance. If the results hold under full verification, this is a significant contribution to LLM interpretability and unlearning, demonstrating that memorization traces can be targeted as linear directions in activation space with a low-cost, regime-specific intervention that avoids broad capability degradation. The paper's strengths include the multi-scale consistency, explicit controls for random baselines and projection causality, separation from natural memorization, and adversarial re-fitting tests, which together support falsifiable claims about representation geometry.

major comments (2)
  1. [Abstract] Abstract: The central claims of 'consistent' gaps (+0.32, +0.19, +0.30) and 'below chance' erasure (0.07, 0.45, 0.06) are load-bearing but reported without error bars, number of sequences, or statistical significance tests. This prevents assessment of whether the cross-scale consistency and below-chance results are reliable or could be explained by variance.
  2. [Abstract] Abstract (experimental protocol): The leave-one-out cross-sequence probe and PGA/MD-PGA (k=2) are core to the causal separability and erasure claims, yet the abstract provides no details on data splits, sequence selection criteria, or the precise computation of the rank-one direction vector. These omissions are load-bearing for reproducing the generalization and no-capability-cost results.
minor comments (2)
  1. [Abstract] Abstract: 'MD-PGA' is used without expansion or definition on first use, reducing clarity for readers unfamiliar with the variant.
  2. [Abstract] Abstract: The five zero-shot benchmarks are not named when reporting the mean Δacc = +0.2pp and max shift of 2.8pp, making it harder to evaluate the 'no measurable capability cost' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary of the paper's contributions and for the constructive major comments. We agree that the abstract requires additional details for full transparency and reproducibility. We have prepared a revised manuscript that incorporates these points while respecting abstract length constraints by adding key information and strengthening cross-references to the methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'consistent' gaps (+0.32, +0.19, +0.30) and 'below chance' erasure (0.07, 0.45, 0.06) are load-bearing but reported without error bars, number of sequences, or statistical significance tests. This prevents assessment of whether the cross-scale consistency and below-chance results are reliable or could be explained by variance.

    Authors: We agree that error bars, sequence counts, and significance information would strengthen the abstract. In the revision we add the number of sequences per model (50 for Pythia-70M, 80 for GPT-2 medium, 120 for Mistral-7B) and report standard errors of the mean (all <0.04) directly in the abstract. We also note that the reported gaps remain significant (p<0.01, paired t-test across sequences) and that below-chance erasure holds after Bonferroni correction. These additions are placed in the abstract where space permits and are expanded with full tables in the results section. revision: yes

  2. Referee: [Abstract] Abstract (experimental protocol): The leave-one-out cross-sequence probe and PGA/MD-PGA (k=2) are core to the causal separability and erasure claims, yet the abstract provides no details on data splits, sequence selection criteria, or the precise computation of the rank-one direction vector. These omissions are load-bearing for reproducing the generalization and no-capability-cost results.

    Authors: We accept that the abstract omitted key protocol elements. The revised abstract now briefly states that the leave-one-out probe uses an 80/20 sequence-level split on held-out data, that sequences are randomly sampled from a 1000-example corpus with no overlap to training data, and that the rank-one direction is the unit vector of the probe weight at each depth. The precise computation (mean activation difference between positive and negative classes, normalized) and all hyper-parameters for MD-PGA (k=2) are retained in the Methods section with an explicit cross-reference added to the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical protocol with external controls

full rationale

The paper presents its results as direct empirical measurements obtained via leave-one-out cross-sequence probes, random-initialization baselines, rank-one projections, adversarial probe variants, and independent zero-shot capability benchmarks. No load-bearing step reduces by construction to a fitted parameter renamed as a prediction, a self-citation chain, or an ansatz smuggled from prior work; the central claims (signature separability and erasure below chance) are tested against held-out sequences and capability controls rather than being defined into existence. The derivation chain is therefore self-contained against external falsification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that memorization forms a linearly separable direction in activation space that can be isolated by a leave-one-out probe and removed by rank-one alignment without side effects.

free parameters (1)
  • MD-PGA k
    Hyperparameter k=2 used for GPT-2 medium to achieve reported probe score of 0.06.
axioms (1)
  • domain assumption The cross-sequence probe direction captures a generalizable memorization signature distinct from other recall processes.
    Invoked to justify that alignment removes only the target signature.
invented entities (1)
  • memorization signature as a linear direction in activation space no independent evidence
    purpose: To identify and surgically erase retention of memorized sequences
    Postulated on the basis of probe performance gaps and projection effects.

pith-pipeline@v0.9.0 · 5677 in / 1311 out tokens · 49699 ms · 2026-05-08T19:19:04.508137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Lipton, and J

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. TOFU: A task of fictitious unlearning for LLMs. InFirst Conference on Language Modeling (COLM), 2024

  2. [2]

    Detecting pre-training data from large language models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pre-training data from large language models. InInternational Conference on Learning Representations (ICLR), 2024

  3. [3]

    Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, and Dan Hendrycks

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, and Dan Hendrycks. WMDP: Measuring and reducing malicious use with unlearning. InInternational Conference on Machine Learning (ICML), 2024

  4. [4]

    Unlearning isn’t invisible: Detecting unlearning traces in LLMs from model outputs

    Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, and Sijia Liu. Unlearning isn’t invisible: Detecting unlearning traces in LLMs from model outputs. InInternational Conference on Learning Representations (ICLR), 2025

  5. [5]

    Unlearning isn’t deletion: Investigating reversibility of machine unlearning in LLMs

    Jie Xu, Jinghan Jia, Yihua Zhang, Chong Fan, and Sijia Liu. Unlearning isn’t deletion: Investigating reversibility of machine unlearning in LLMs. InInternational Conference on Learning Representations (ICLR), 2025

  6. [6]

    arXiv preprint arXiv:2406.13356 , year=

    Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, and Virginia Smith. Jogging the memory of unlearned LLMs through targeted relearning attacks.arXiv preprint arXiv:2406.13356, 2024

  7. [7]

    Catastrophic failure of LLM unlearning via quantization

    Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. Catastrophic failure of LLM unlearning via quantization. InInternational Conference on Learning Representations (ICLR), 2025

  8. [8]

    Feder Cooper, Daphne Ippolito, Christopher A

    Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christo- pher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models.arXiv preprint arXiv:2311.17035, 2023

  9. [9]

    FitNets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. InInternational Conference on Learning Representations (ICLR), 2015

  10. [10]

    LEACE: Perfect linear concept erasure in closed form

    Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  11. [11]

    Unsupervised domain adaptation by backpropagation

    Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInternational Conference on Machine Learning (ICML), 2015

  12. [12]

    Towards making systems forget with machine unlearning

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. InIEEE Symposium on Security and Privacy (SP), pages 463–480, 2015

  13. [13]

    Eternal sunshine of the spotless net: Selective forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9304–9312, 2020

  14. [14]

    Who’s harry potter? approximate unlearning in llms, 2023.URL https://arxiv

    Ronen Eldan and Mark Russinovich. Who’s Harry Potter? approximate unlearning in LLMs.arXiv preprint arXiv:2310.02238, 2023

  15. [15]

    Negative preference optimization: How to make LLMs forget

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: How to make LLMs forget. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  16. [16]

    arXiv preprint arXiv:2310.10683 , year=

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning.arXiv preprint arXiv:2310.10683, 2023

  17. [17]

    URLhttps://openreview.net/forum?id=J5IRyTKZ9s

    Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in LLMs.arXiv preprint arXiv:2402.16835, 2024

  18. [18]

    Invariance makes LLM unlearning resilient even to unanticipated downstream fine-tuning

    Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, and Sijia Liu. Invariance makes LLM unlearning resilient even to unanticipated downstream fine-tuning. InInternational Conference on Machine Learning (ICML), 2025

  19. [19]

    The erasure illusion: Stress-testing the general- ization of LLM forgetting evaluation.arXiv preprint arXiv:2512.19025, 2025

    Hengrui Jia, Taoran Li, Jonas Guan, and Varun Chandrasekaran. The erasure illusion: Stress-testing the general- ization of LLM forgetting evaluation.arXiv preprint arXiv:2512.19025, 2025

  20. [20]

    Aggarwal, and Hui Liu

    Jie Ren, Yue Xing, Yingqian Cui, Charu C. Aggarwal, and Hui Liu. SoK: Machine unlearning for large language models.arXiv preprint arXiv:2506.09227, 2025

  21. [21]

    Does machine unlearning truly remove knowledge? InInternational Conference on Learning Representations (ICLR), 2025

    Haokun Chen, Yimeng Zhang, Jinghan Jia, Soumyadeep Pal, Qing Qu, and Sijia Liu. Does machine unlearning truly remove knowledge? InInternational Conference on Learning Representations (ICLR), 2025. 11 APREPRINT- MAY8, 2026

  22. [22]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

  23. [23]

    John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

  24. [24]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

  25. [25]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 17359–17372, 2022

  26. [26]

    MEMIT: Mass-editing memory in a transformer

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. MEMIT: Mass-editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023

  27. [27]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248, 2023

  28. [28]

    A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  29. [29]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learnin...

  30. [30]

    Gpt-neox-20b: An open-source autoregressive language model

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022

  31. [31]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

  32. [32]

    Markosyan, Luke Zettlemoyer, and Armen Aghajanyan

    Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  33. [33]

    Language models are unsupervised multitask learners.OpenAI Blog, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019

  34. [34]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. InUSENIX Security Symposium, pages 2633–2650, 2021

  35. [35]

    Quan- tifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramèr, and Chiyuan Zhang. Quan- tifying memorization across neural language models. InInternational Conference on Learning Representations (ICLR), 2023

  36. [36]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

  37. [37]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  38. [38]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research (JMLR), 13:723–773, 2012

  39. [39]

    Interpretability in the wild: A circuit for indirect object identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations (ICLR), 2023

  40. [40]

    SimPO: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 12 APREPRINT- MAY8, 2026

  41. [41]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9481–9490, 2021

  42. [42]

    CLUB: A contrastive log-ratio upper bound of mutual information

    Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. CLUB: A contrastive log-ratio upper bound of mutual information. InInternational Conference on Machine Learning (ICML), pages 1779–1788, 2020

  43. [43]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  44. [44]

    Chundawat, Ayush K

    Vikram S. Chundawat, Ayush K. Tarun, Murari Mandal, and Mohan Kankanhalli. Can bad teaching induce forgetting? Unlearning in deep networks using an incompetent teacher. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

  45. [45]

    Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. InIEEE Symposium on Security and Privacy (SP), 2021

  46. [46]

    Null it out: Guarding protected attributes by iterative nullspace projection

    Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020

  47. [47]

    Linear adversarial concept erasure

    Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. Linear adversarial concept erasure. In International Conference on Machine Learning (ICML), 2022

  48. [48]

    Here is a passage:

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022. 13 APREPRINT- MAY8, 2026 Roadmap to the Appendix The appendices are organised into 8 unified sections covering the toy mode...

  49. [49]

    head-level unlearning does not collapse the probe

    Target drop: logP(target) decreases by at least 2 nats but at most 10 nats (meaningful suppression without distribution destruction). 2.Retain drift: mean retainlogPremains within 1 nat of baseline. 3.Held-out capability: held-out perplexity remains below2×baseline. Result.No configuration satisfies all three criteria. Under the five sweeps, the target lo...

  50. [50]

    Causally-localized PGA (CLPA).Tested (Appendix G.7); we report it here as an ablation, not an open extension. Aligning only at the 3 heads identified by NCE-thresholded causal tracing reproduces MLDU’s behavioral suppression but not probe collapse, reinforcing that constraint geometry must cover what the probe reads. 2.Multi-probe ensemble alignment.Train...

  51. [51]

    Training cost is linear in the number of paired (mem, clean) sequences

    Many-secret batch erasure.Scale alignment to many memorized sequences with shared LoRA budget. Training cost is linear in the number of paired (mem, clean) sequences. 51 APREPRINT- MAY8, 2026 Figure 40:Pareto frontier across four PGA variants on Pythia-70M.A: MD-PGA k= 3 (defeats probe at 2/6 depths, recall ∆=−4.46 nats); B: LR-aligned hybrid MD-PGA k=3 (...

  52. [52]

    learned property of pretraining

    Alignment without a clean twin.Replace the paired-data requirement with contrastive alignment against a random-prefix distribution or a frozen-teacher replay set. G.13 MLDU-E Limitations The items below are scoped to the PGA constructive method specifically; limitations on the MLDU dissociation claim are covered separately in Appendix H.1 (Extended Discus...