pith. machine review for the scientific record. sign in

arxiv: 2604.14808 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM unlearningmachine unlearninggradient synthesisgradient conflict resolutionasymmetric multi-task learningretention preservationforgetting mechanisms
0
0 comments X

The pith

Reshaping gradient geometry, rather than re-balancing losses, mitigates the trade-off in LLM unlearning between forgetting specific knowledge and retaining general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models LLM unlearning as an asymmetric two-task learning problem where retaining general model capabilities is the primary goal and forgetting targeted knowledge is secondary. It proposes a retention-prioritized gradient synthesis framework that decouples extraction of task-specific gradients from their conflict-aware combination. Instantiations including an adaptation of PCGrad and a new method called SAGO ensure the combined update has non-negative cosine similarity to the retention gradient, with SAGO achieving tighter constructive alignment through sign constraints. Experiments on WMDP Bio/Cyber and RWKU benchmarks show these methods improve retention metrics, such as raising MMLU recovery from 44.6 percent to 96 percent in one setting, while preserving comparable forgetting strength. This matters because effective unlearning is required to remove harmful or private data from deployed models without destroying their broader utility.

Core claim

By recasting LLM unlearning as an asymmetric two-task problem with retention as the primary objective, the authors demonstrate that a gradient synthesis framework can resolve conflicts between retention and forgetting gradients. Both PCGrad and the new SAGO method guarantee non-negative cosine similarity between the synthesized gradient and the retain gradient, while SAGO further tightens alignment via sign-constrained synthesis. On WMDP benchmarks this produces superior Pareto fronts, for instance recovering 96 percent of target model MMLU performance under SimNPO+GD while maintaining forgetting strength, showing that gradient geometry reshaping outperforms simple loss re-balancing.

What carries the argument

The retention-prioritized gradient synthesis framework, which separates task-specific gradient extraction from conflict-aware combination and enforces non-negative cosine similarity with the retain gradient; SAGO is the novel sign-constrained instantiation that achieves strictly tighter alignment.

If this is right

  • Unlearning procedures can be made to recover substantially more original model capability while still achieving the required forgetting level.
  • Multi-task gradient conflict techniques can be directly transferred to unlearning settings once the retention task is designated primary.
  • Future unlearning algorithms should prioritize geometric alignment of gradients over scalar loss weighting.
  • SAGO offers a drop-in replacement for existing gradient-combination steps that yields measurable retention gains on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradient-synthesis logic could extend to other asymmetric objectives in LLMs, such as safety alignment where a primary capability must be preserved alongside an auxiliary constraint.
  • If the non-negative similarity condition proves sufficient across model scales, it may enable more parameter-efficient unlearning that avoids full retraining or heavy fine-tuning.
  • The approach invites direct comparison with continual-learning methods that also seek to avoid catastrophic interference between old and new objectives.

Load-bearing premise

The assumption that gradient conflicts between retention and forgetting objectives are the dominant source of the performance trade-off and that non-negative cosine similarity with the retain gradient is sufficient to preserve capability.

What would settle it

A controlled experiment on the same WMDP Bio or Cyber benchmarks in which a loss-rebalancing baseline without any gradient synthesis achieves equivalent or better retention-forgetting Pareto performance than SAGO would falsify the claim that gradient geometry reshaping is the decisive factor.

Figures

Figures reproduced from arXiv: 2604.14808 by Guanhua Chen, Jian Yang, Siqing Li, Xuetao Wei, Yong Wang, Yun Chen, Zeguan Xiao.

Figure 1
Figure 1. Figure 1: Visualization of loss dynamics in LLM unlearning and retention-prioritized frameworks. Panels (a) and (b) show retain and forget losses on the WMDP Biosecurity benchmark using GradDiff, PCGrad, and SAGO. SAGO outperforms in maintaining low retain loss while achieving high forget performance, indicating reduced gradient conflicts and improved retention. Panel (c) shows that while GradDiff (Original) struggl… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of final update gradients (red) in PCGrad (a) and SAGO (b). For PCGrad, the forget gradient (gf ) is projected orthogonally onto the retain gradient (gr), and the resulting projected vector is then combined with gr. For SAGO, when the two gradients conflict, gr is used, and when the gradients align, gf is applied. The updates produced by SAGO demonstrate a higher degree of alignment with gr 3.… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison across different methods on WMDP benchmark. The plot provides a visualization of the trade-offs between forget and retain performance on WMDP Bio (left) and Cyber (right). A smaller forget metric indicates better forgetting effectiveness, while a larger MMLU value reflects better retention performance. Dashed lines connect base methods to their enhanced variants within the same famil… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across differ￾ent methods on RWKU benchmark. The plot pro￾vides a visualization of the trade-offs between forget and retain performance on RWKU. A smaller forget￾ting ROUGE-L indicates better forgetting effectiveness, while a larger retention ROUGE-L reflects better reten￾tion performance. The reported results highlight that SAGO produces a better frontier than baselines and PCGrad. … view at source ↗
read the original abstract

Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper recasts LLM unlearning as an asymmetric two-task problem (retention primary, forgetting auxiliary) and proposes a retention-prioritized gradient synthesis framework. It adapts PCGrad for conflict resolution and introduces SAGO, a sign-constrained synthesis method. Both ensure non-negative cosine similarity with the retain gradient (SAGO strictly tighter), with empirical results on WMDP Bio/Cyber and RWKU showing MMLU recovery gains (e.g., 44.6% naive to 96.0% +SAGO on WMDP Bio with SimNPO+GD) while preserving forgetting strength. The central claim is that reshaping gradient geometry, not loss re-balancing, mitigates the trade-off.

Significance. If the results and attribution hold, the work offers a useful reframing of unlearning via gradient geometry with concrete theoretical cosine-similarity guarantees and consistent Pareto-frontier improvements on standard benchmarks. Credit is due for the explicit non-negative alignment bounds and the reproducible-style empirical protocol on public datasets. The significance is reduced by the missing isolation of geometry from balancing, which leaves the key mechanistic claim under-supported.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Experiments): The claim that 're-shaping gradient geometry, rather than re-balancing losses, is the key' is load-bearing but unsupported by direct evidence. No ablation holds the gradient combination rule fixed while varying only the scalar loss weight λ in a baseline L = L_retain + λ L_forget (with grid search on λ) to produce a comparable Pareto curve. The reported gains (WMDP Bio row: naive 44.6% → +PCGrad 94.0% → +SAGO 96.0%) are measured against a naive joint-optimization baseline that may already embed implicit balancing, so the attribution to geometry remains unisolated.
  2. [§3 and §4] §3 (Method) and §4 (Theory): The framework assumes gradient conflicts (negative cosine similarity between retain and forget gradients) are the dominant source of the trade-off and that enforcing non-negative cosine similarity is sufficient to preserve capability. While Theorem 1 and the PCGrad/SAGO constructions guarantee non-negative alignment, the paper provides no diagnostic (e.g., correlation between achieved cosine similarity and downstream MMLU recovery across runs) to confirm this is the primary mechanism rather than magnitude or higher-order effects.
minor comments (2)
  1. [§5.2] §5.2: Table captions should explicitly state the exact loss formulation and hyper-parameters used for the 'naive' baseline to allow direct reproduction.
  2. [§3] Notation: The distinction between g_r (retain gradient) and g_f (forget gradient) is introduced late; early standardization in §3 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major comment below and will revise the manuscript to incorporate additional ablations and diagnostics that strengthen the empirical grounding of our claims.

read point-by-point responses
  1. Referee: [Abstract and §5] The claim that 're-shaping gradient geometry, rather than re-balancing losses, is the key' is load-bearing but unsupported by direct evidence. No ablation holds the gradient combination rule fixed while varying only the scalar loss weight λ in a baseline L = L_retain + λ L_forget (with grid search on λ) to produce a comparable Pareto curve. The reported gains (WMDP Bio row: naive 44.6% → +PCGrad 94.0% → +SAGO 96.0%) are measured against a naive joint-optimization baseline that may already embed implicit balancing, so the attribution to geometry remains unisolated.

    Authors: We agree that isolating the effect of gradient geometry from loss re-balancing requires a direct comparison against an optimized scalar weighting. Our current naive baseline uses the conventional fixed λ=1, which is standard in the literature but not necessarily optimal. To address this, we will add a new ablation in the revised §5 that performs a grid search over λ for the naive joint-optimization baseline and plots the resulting Pareto curves alongside PCGrad and SAGO on the WMDP Bio and Cyber benchmarks. This will allow readers to see whether geometry-based synthesis still yields gains beyond the best achievable loss balancing. revision: yes

  2. Referee: [§3 and §4] The framework assumes gradient conflicts (negative cosine similarity between retain and forget gradients) are the dominant source of the trade-off and that enforcing non-negative cosine similarity is sufficient to preserve capability. While Theorem 1 and the PCGrad/SAGO constructions guarantee non-negative alignment, the paper provides no diagnostic (e.g., correlation between achieved cosine similarity and downstream MMLU recovery across runs) to confirm this is the primary mechanism rather than magnitude or higher-order effects.

    Authors: We concur that a direct diagnostic linking achieved cosine similarity to capability recovery would strengthen the mechanistic interpretation. Although the theoretical guarantees and consistent empirical improvements support the role of alignment, we did not report such a correlation. In the revision we will add a diagnostic analysis (new figure or table in §5) that computes and correlates the cosine similarity between the synthesized gradient and the retain gradient with MMLU recovery across methods and random seeds. This will help verify whether alignment is the primary driver or whether magnitude and higher-order effects also contribute. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper recasts unlearning as an asymmetric two-task problem and constructs a retention-prioritized gradient synthesis framework from this perspective, adapting the external PCGrad method and introducing the novel SAGO synthesis rule. Theoretical guarantees on non-negative cosine similarity are derived directly from the proposed sign-constrained combination rule rather than being presupposed, and empirical Pareto improvements on WMDP/RWKU benchmarks are reported as independent measurements. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and the central attribution to gradient geometry reshaping rests on the new synthesis construction rather than renaming or re-deriving prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that gradient conflicts are the main driver of the unlearning trade-off and that alignment with the retain gradient is a sufficient proxy for capability preservation; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Gradient conflicts between retention and forgetting objectives are the primary cause of performance degradation in LLM unlearning
    This premise motivates the entire gradient-synthesis framework and the focus on cosine similarity.

pith-pipeline@v0.9.0 · 5535 in / 1276 out tokens · 57826 ms · 2026-05-10T11:35:23.687423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    Simplicity prevails: Rethinking negative pref- erence optimization for llm unlearning.arXiv preprint arXiv:2410.07163,

    IEEE. Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. 2020. Just pick a sign: Op- timizing deep multitask models with gradient sign dropout.Advances in Neural Information Processing Systems, 33:2039–2050. 9 Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha ...

  2. [2]

    Model merging in llms, mllms, and beyond: Methods, theories, applica- tions and opportunities,

    Distract large language models for automatic jailbreak attack. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 16230–16244, Miami, Florida, USA. Association for Computational Linguistics. Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. 2023. Ties-merging: Re- solving interfere...

  3. [3]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.Preprint, arXiv:2308.06463. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrik- son. 2023. Universal and transferable...