hub

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei · 2024 · cs.LG · arXiv 2404.05868

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

Large Language Models (LLMs) often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

cs.LG · 2026-05-04 · conditional · novelty 8.0 · 2 refs

INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.

Machine Unlearning for Masked Diffusion Language Models

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

Is your algorithm unlearning or untraining?

cs.LG · 2026-04-09 · conditional · novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).

Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty

cs.AI · 2026-06-23 · accept · novelty 6.0

Heuresis evaluates six search strategies for LLM research agents and shows they steer ideas along quality-diversity-novelty axes but fail to generate novel ideas that match or exceed known high-performing recipes.

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.

Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

cs.LG · 2026-05-09 · conditional · novelty 6.0

Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.

Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.

Threshold-Guided Optimization for Visual Generative Models

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

cs.LG · 2026-02-27 · unverdicted · novelty 6.0

MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.

Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

cs.LG · 2025-12-13 · unverdicted · novelty 6.0

Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal side effects on other features.

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

cs.LG · 2025-10-01 · conditional · novelty 6.0

Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

cs.CL · 2025-05-22 · unverdicted · novelty 6.0

Machine unlearning in LLMs is often reversible via fine-tuning, indicating suppression not deletion, and a new representation-level framework identifies four forgetting regimes based on reversibility and catastrophicity.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

cs.LG · 2026-04-21 · unverdicted · novelty 5.0

TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

cs.LG · 2025-12-17 · unverdicted · novelty 5.0

A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.

citing papers explorer

Showing 27 of 27 citing papers.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 54 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning cs.LG · 2026-05-04 · conditional · none · ref 8 · 2 links · internal anchor
INT4 quantization recovers up to 22 times more forgotten training data in unlearned LLMs, and the proposed DURABLEUN-SAF method is the first to maintain forgetting across BF16, INT8, and INT4 precisions.
Machine Unlearning for Masked Diffusion Language Models cs.CL · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
MDU minimizes forward KL divergence from prompt-conditional to prompt-masked unconditional predictions at masked positions to unlearn knowledge in MDLMs while trading off privacy and utility via temperature scaling.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Inducing Artificial Uncertainty in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 41 · internal anchor
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 26 · internal anchor
PPU-Bench is a real-world benchmark exposing forget-retain trade-offs in MLLM unlearning and motivating Boundary-Aware Optimization to enforce intra-subject factual boundaries.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory cs.AI · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Is your algorithm unlearning or untraining? cs.LG · 2026-04-09 · conditional · none · ref 35 · internal anchor
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty cs.AI · 2026-06-23 · accept · none · ref 72 · internal anchor
Heuresis evaluates six search strategies for LLM research agents and shows they steer ideas along quality-diversity-novelty axes but fail to generate novel ideas that match or exceed known high-performing recipes.
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 26 · internal anchor
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning cs.LG · 2026-05-09 · conditional · none · ref 8 · internal anchor
Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning cs.AI · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 28 · internal anchor
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning cs.LG · 2026-04-23 · unverdicted · none · ref 21 · internal anchor
PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies cs.AI · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models cs.LG · 2026-02-27 · unverdicted · none · ref 15 · internal anchor
MPU is a framework that achieves privacy-preserving unlearning for LLMs by distributing perturbed model copies for local client-side unlearning followed by server-side aggregation with harmonic denoising.
Sparse Concept Anchoring for Interpretable and Controllable Neural Representations cs.LG · 2025-12-13 · unverdicted · none · ref 34 · internal anchor
Sparse Concept Anchoring biases neural latent spaces toward targeted concepts using under 0.1% labels per concept, enabling reversible steering via projection and permanent removal via weight ablation with minimal side effects on other features.
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning cs.LG · 2025-10-01 · conditional · none · ref 10 · internal anchor
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs cs.CL · 2025-05-22 · unverdicted · none · ref 43 · internal anchor
Machine unlearning in LLMs is often reversible via fine-tuning, indicating suppression not deletion, and a new representation-level framework identifies four forgetting regimes based on reversibility and catastrophicity.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 89 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 69 · internal anchor
Unlearned language models retain low calibration error but show increased shortcut reliance on the TOFU benchmark, extending the reliability paradox to machine unlearning.
Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
TabGRAA applies group-relative advantage alignment in an iterative reward-guided post-training loop to improve tabular language model generators on fidelity, utility, and privacy trade-offs across five benchmarks.
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models cs.LG · 2025-12-17 · unverdicted · none · ref 11 · internal anchor
A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
What Is Preference Optimization Doing, and Why? cs.LG · 2025-11-30 · unverdicted · none · ref 25 · internal anchor
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
Revisiting the Past: Data Unlearning with Model State History cs.LG · 2025-06-26 · unverdicted · none · ref 37 · internal anchor
MSA performs data unlearning in LLMs by arithmetic operations on prior model checkpoints to remove targeted datapoint influence, with experiments showing competitive or better results than existing unlearning methods.
Revisiting Privacy Leakage in Machine Unlearning: Membership Inference Beyond the Forgotten Set cs.CR · 2026-05-01 · unreviewed · ref 54 · internal anchor

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer