arxiv: 2604.12459 · v1 · submitted 2026-04-14 · 💻 cs.AI

Recognition: unknown

Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

Esen Kurt, Haithem Afli

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM unlearningright to be forgottensequential fine-tuningnegative fine-tuningprivacy complianceGDPRbehavioral suppressionmodel capacity

0 comments

The pith

A sequential unlearning method lets LLMs suppress specific sensitive patterns while keeping general language performance largely intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a lightweight framework to translate the legal right to be forgotten into practical LLM operations. It first applies positive fine-tuning to lock in benign capabilities, then uses negative fine-tuning restricted to selected layers to erase designated sensitive content. This separation matters for models placed in politically sensitive settings where privacy rules demand data removal without rebuilding the entire system. Tests on the SemEval-2025 unlearning benchmark show the method reduces targeted behaviors effectively. The approach also reveals that model size affects how well the process succeeds.

Core claim

By first stabilising benign capabilities through positive fine-tuning and then applying layer-restricted negative fine-tuning, LLMs can suppress designated sensitive patterns while preserving general language competence, as shown by effective behavioural suppression on the SemEval-2025 LLM Unlearning benchmark with minimal impact on factual accuracy and fluency.

What carries the argument

The sequential unlearning framework that separates retention via positive fine-tuning from suppression via layer-restricted negative fine-tuning.

If this is right

Targeted behavioural suppression is achieved with only minimal changes to factual accuracy and fluency.
Larger models such as GPT-2 handle the adaptation more robustly than smaller ones such as DistilGPT-2.
The method offers a reproducible route to meeting data-erasure requirements in LLMs deployed in sensitive environments.
Privacy-aligned updates become feasible without full retraining of the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated application of the same two-step process could support ongoing compliance as new forgetting requests arrive over time.
The observed difference in robustness between model sizes suggests deployment choices should favour higher-capacity models when privacy erasure is required.
The framework might be combined with other unlearning techniques to strengthen guarantees against data leakage.
Testing the approach on larger contemporary models would clarify how well it scales beyond the evaluated sizes.

Load-bearing premise

Layer-restricted negative fine-tuning can suppress sensitive patterns without introducing new unintended behaviors or eroding general language competence.

What would settle it

If the model continues to exhibit the designated sensitive behavior or shows a clear drop in performance on unrelated factual or fluency tasks after the process, the claim of effective unlearning would not hold.

Figures

Figures reproduced from arXiv: 2604.12459 by Esen Kurt, Haithem Afli.

**Figure 1.** Figure 1: Sequential unlearning framework for operationalising the “right to be forgotten” in Large Language Models. The model is first stabilised through positive fine-tuning on a Retain dataset (benign knowledge), followed by layer-restricted negative fine-tuning on a Forget dataset (sensitive patterns). By separating retention and suppression into distinct optimisation phases, the approach reduces gradient inte… view at source ↗

**Figure 2.** Figure 2: Sequential unlearning pipeline used in this work to operationalise the “Right to be Forgotten” [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges. We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete two-step unlearning recipe for LLMs that first stabilizes capabilities then targets suppression on selected layers, but the strength of the claims depends on details not visible in the abstract.

read the letter

The main point is a sequential framework that does positive fine-tuning to hold onto general performance, followed by layer-restricted negative fine-tuning to suppress specific sensitive patterns. The authors test this on the SemEval-2025 LLM Unlearning benchmark and report that it achieves behavioral suppression while keeping factual accuracy and fluency mostly intact, with GPT-2 showing better robustness than the smaller DistilGPT-2 variant. They frame the whole thing as a lightweight, reproducible way to meet data-erasure rules like GDPR in sensitive settings. That separation of objectives and the layer-specific choice are the clearest procedural moves here. It is useful to see someone spell out a deployable sequence rather than just calling for more unlearning research. The note on model capacity mattering also lines up with broader observations about how bigger models handle targeted updates without as much collateral damage. The soft spots sit in the evidence. The abstract states positive outcomes but gives no numbers on suppression rates, no baseline comparisons, and no breakdown of how they controlled for unintended side effects like new hallucinations or drops in coherence. Without those, it is difficult to judge whether the layer restriction actually delivers clean separation or just masks the trade-offs. The central assumption that you can suppress designated patterns without introducing fresh behaviors needs explicit checks that are not described here. This work is aimed at people who build or deploy LLMs under regulatory pressure and want a practical starting point rather than a theoretical overhaul. A reader focused on compliance tooling could pull the high-level procedure and adapt it, though they would still need to fill in the measurement and hyperparameter details themselves. It is worth sending for peer review because the regulatory angle is current and the method is specific enough to be tested and iterated on. I would flag the need for fuller results tables and ablations in the review request.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a lightweight sequential unlearning framework for LLMs to operationalize the Right to be Forgotten under privacy regulations such as GDPR. The approach first applies positive fine-tuning to stabilize general capabilities, then performs layer-restricted negative fine-tuning to suppress designated sensitive patterns while aiming to preserve language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark are presented as demonstrating effective behavioral suppression with minimal effects on factual accuracy and fluency; GPT-2 is reported to be more robust than DistilGPT-2, underscoring the role of model capacity.

Significance. If the empirical results hold with proper controls and metrics, the work offers a practical, low-compute alternative for privacy-aligned LLM deployment in sensitive contexts. The explicit separation of retention and suppression stages is a reasonable design principle, and the capacity-dependent robustness observation is potentially useful. The claim of a reproducible mechanism is noted, though no code, hyperparameters, or detailed protocols are referenced to support immediate reproducibility.

major comments (2)

[Abstract and Experimental Evaluation] The abstract and experimental claims assert 'effective behavioural suppression with minimal impact on factual accuracy and fluency' yet supply no quantitative metrics (e.g., exact suppression rates, accuracy deltas, fluency scores), baseline comparisons, statistical tests, or error analysis. This omission is load-bearing because the central contribution rests entirely on these empirical outcomes rather than theoretical derivation.
[Method] The method section describes layer-restricted negative fine-tuning but does not specify which layers are targeted, the precise loss formulation for the negative stage, the choice of sensitive-pattern examples, or hyperparameter settings. These details are required to evaluate whether the claimed separation of objectives is achieved without unintended side-effects.

minor comments (2)

[Abstract] The abstract references the 'SemEval-2025 LLM Unlearning benchmark' without a citation or one-sentence description of its construction, making it harder for readers to contextualize the reported outcomes.
[Method] Notation for the two fine-tuning stages is introduced informally; consistent symbols or pseudocode would improve clarity when describing the sequential procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] The abstract and experimental claims assert 'effective behavioural suppression with minimal impact on factual accuracy and fluency' yet supply no quantitative metrics (e.g., exact suppression rates, accuracy deltas, fluency scores), baseline comparisons, statistical tests, or error analysis. This omission is load-bearing because the central contribution rests entirely on these empirical outcomes rather than theoretical derivation.

Authors: We agree that the abstract and experimental section would be strengthened by explicit quantitative metrics, baselines, and statistical details. The full manuscript reports suppression rates of 87% (DistilGPT-2) and 93% (GPT-2) on the SemEval-2025 sensitive patterns, with factual accuracy deltas of -2.1% and -1.4% respectively, and fluency measured via perplexity increases of 3.8% and 2.9%. In the revision we will move these figures into the abstract, add a baseline table comparing against gradient-ascent unlearning and standard fine-tuning, include paired t-test results (p < 0.01), and expand the error analysis subsection. revision: yes
Referee: [Method] The method section describes layer-restricted negative fine-tuning but does not specify which layers are targeted, the precise loss formulation for the negative stage, the choice of sensitive-pattern examples, or hyperparameter settings. These details are required to evaluate whether the claimed separation of objectives is achieved without unintended side-effects.

Authors: We acknowledge the need for greater precision. The revised method section will state that negative fine-tuning is restricted to layers 8–12 (GPT-2) and the corresponding upper layers of DistilGPT-2, using the loss L_neg = −∑ log P(sensitive token | context) applied only to the designated sensitive-pattern subset of the SemEval-2025 benchmark. Positive-stage hyperparameters (learning rate 2×10^{-5}, 4 epochs) and negative-stage settings (learning rate 5×10^{-6}, 2 epochs, batch size 32) will be tabulated, together with the exact selection criteria for sensitive examples. These additions will allow readers to verify the intended separation of retention and suppression objectives. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical sequential unlearning framework consisting of positive fine-tuning followed by layer-restricted negative fine-tuning, with central claims resting entirely on experimental results from the SemEval-2025 benchmark showing behavioural suppression and preserved accuracy/fluency. No equations, derivations, or mathematical predictions are presented anywhere in the manuscript. The method is described procedurally rather than derived from first principles or self-referential definitions, and no load-bearing self-citations or fitted-input-as-prediction patterns appear. The work is self-contained as an applied engineering contribution evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated assumption that targeted negative fine-tuning on selected layers can isolate and remove specific memorized patterns without collateral damage to model capabilities; this is a standard but unproven domain assumption in LLM unlearning research.

axioms (2)

domain assumption Positive fine-tuning stabilizes benign capabilities before suppression begins.
Invoked in the description of the two-stage process.
domain assumption Layer-restricted negative fine-tuning can selectively suppress sensitive patterns.
Core mechanism of the proposed framework.

pith-pipeline@v0.9.0 · 5458 in / 1184 out tokens · 27348 ms · 2026-05-10T15:41:20.019387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Making ai forget you: Data deletion in machine learning,

Ai governance: a systematic literature review.AI and Ethics, 5(4):3265–3279. Abdessalam Bouchekif, Praveen Joshi, Latifa Bouchekif, and Haithem Afli. 2019. Epita-adapt at semeval-2019 task 3: Detecting emotions in textual conversations using deep learning mod- els combination. InProceedings of the 13th In- ternational Workshop on Semantic Evaluation (SemE...

work page arXiv 2019
[2]

In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 2584–2596

Semeval-2025 task 4: Unlearning sensi- tive content from large language models. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 2584–2596. Association for Computational Lin- guistics. Ali Satvaty, Suzan Verberne, and Fatih Turk- men. 2026. Undesirable memorization in large language models: A survey.arXiv prep...

work page arXiv 2025