arxiv: 2604.17215 · v1 · submitted 2026-04-19 · 💻 cs.LG

Recognition: unknown

Continual Safety Alignment via Gradient-Based Sample Selection

Thong Bach , Dung Nguyen , Thao Minh Le , Truyen Tran

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual fine-tuningsafety alignmentgradient-based selectionalignment driftlarge language modelssample filteringtask adaptationrefusal behavior

0 comments

The pith

Filtering out high-gradient samples during fine-tuning preserves safety alignment in language models while they learn new tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose safety behaviors like refusing harmful requests when fine-tuned on new tasks, even benign ones. The paper shows that not all training samples affect safety equally: those with high gradient magnitudes drive the model back toward its original pretrained behavior and cause more alignment drift. By contrast, moderate-gradient samples support learning the new task with far less safety loss. The proposed method simply drops the high-gradient samples from each training batch, yielding better safety retention across models and tasks without any curated safe data or model architecture changes.

Core claim

Training samples contribute unequally to alignment degradation. High-gradient samples during fine-tuning cause greater safety loss by pushing the model toward its pretrained distribution, while moderate-gradient samples enable effective task learning with minimal alignment impact. Selecting only moderate-gradient samples therefore maintains refusal behaviors, truthfulness, and commonsense reasoning at competitive levels across continual domain adaptation scenarios.

What carries the argument

Gradient-based sample selection that excludes high-gradient training samples to reduce safety degradation during fine-tuning.

Load-bearing premise

Gradient magnitude reliably flags the samples most responsible for safety drift, and removing those samples will not block the model from learning the new task on the remaining data.

What would settle it

An experiment in which safety still degrades when training only on moderate-gradient samples, or in which task performance collapses after high-gradient samples are removed.

Figures

Figures reproduced from arXiv: 2604.17215 by Dung Nguyen, Thao Minh Le, Thong Bach, Truyen Tran.

**Figure 1.** Figure 1: Overview of gradient-based sample selection for continual safety alignment. (a) Aligned model θ0 and task dataset with varying gradient magnitudes. (b) Our method filters high-Gi samples (red) and selects moderate-Gi samples (blue). (c) Standard fine-tuning activates elastic reversion, pulling models toward pretrained distributions. (d) Our method enables controlled updates preserving alignment. (e) Result… view at source ↗

**Figure 2.** Figure 2: Safety landscape visualization for Qwen-2.5- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Task interference matrix. Negative values (blue) indicate interference; positive (red) indicates transfer. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: GSM8K trajectory on Qwen3. Moderate-Gi maintains gradual degradation while baselines exhibit catastrophic drops. dom and 54.2% for Baseline, representing a 3.9 percentage point gain. 5.4 Catastrophic Forgetting Analysis We analyze catastrophic forgetting using standard continual learning metrics from (Lopez-Paz and Ranzato, 2017): Backward Transfer (BWT), measuring how learning new tasks affects previous … view at source ↗

**Figure 5.** Figure 5: Continual learning metrics visualization. (a) BWT on Qwen3-4B shows 14.2% improvement for Moderate [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Total task interference (sum of all pairwise effects). Moderate- [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a practical gradient-norm filter to slow safety drift in continual LLM fine-tuning, but the causal story rests on correlations that still need checks against length and difficulty confounds.

read the letter

The core idea is straightforward: during fine-tuning on new tasks, drop the examples whose per-sample gradient norms (w.r.t. the task loss) are largest. The authors report that these high-norm samples drive the model back toward its original pretrained distribution and hurt refusal, truthfulness, and commonsense metrics, while moderate-norm samples let the model pick up the new task with less alignment loss. They test the filter across model families and claim it preserves safety better than standard fine-tuning while keeping task performance competitive, all without curated safe data or extra modules. That data-centric observation and the simplicity of the fix are the useful parts; anyone running continual adaptation on deployed models will recognize the problem and see why a cheap selection rule could help.

Referee Report

3 major / 2 minor

Summary. The paper investigates the role of individual training samples in causing safety alignment drift during continual fine-tuning of LLMs on new tasks. Through empirical analysis, it claims that samples with high gradient magnitudes (computed w.r.t. the fine-tuning loss) disproportionately drive safety degradation and push models toward pretrained distributions, while moderate-gradient samples support task learning with less alignment loss. The authors propose a simple gradient-based sample selection method that filters high-gradient samples during training. They report that this approach substantially improves preservation of safety behaviors (e.g., refusal of harmful requests, truthfulness) while maintaining competitive task performance across multiple model families, continual domain tasks, selection ratios, task orderings, and attack benchmarks, without requiring curated safe data or architectural modifications.

Significance. If the central empirical claims hold after addressing the noted gaps, this offers a practical, data-centric solution to the safety-forgetting problem in continual LLM adaptation. The approach is lightweight, requires no extra data or model changes, and provides an interpretable view into sample contributions to alignment drift. This could be valuable for real-world deployment where models must adapt to new domains while retaining safety properties. The reported robustness across settings and model families strengthens potential impact if the method generalizes beyond the evaluated cases.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Setup): The per-sample gradient norm computation is not specified (e.g., exact loss function used—task loss only, or including safety terms?—norm type, and whether gradients are w.r.t. all parameters or a subset). This detail is load-bearing for the selection rule and for reproducing the claim that high-gradient samples cause greater degradation.
[§4.2] §4.2 (Empirical Analysis): The observation that high-gradient samples drive greater safety degradation lacks controls or ablations for obvious confounders such as token length, perplexity, or sample difficulty. Without these, it remains unclear whether gradient magnitude is causally linked to alignment drift or spuriously correlated, undermining the motivation for the filtering method.
[§4.3] §4.3 (Robustness and Generalization): No experiment tests whether the same high-gradient filtering rule preserves (or harms) performance when the safety metric is replaced by an unrelated downstream task. This is needed to confirm that the selection selectively protects alignment rather than simply discarding informative or difficult samples in general.

minor comments (2)

[§3] Clarify the precise definition of 'selection ratio' and how the threshold for 'high-gradient' is determined (fixed percentile, adaptive, etc.) in the method description.
[§4] Add statistical significance tests or error bars to the reported gains across benchmarks to strengthen the 'consistent empirical gains' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. These comments have helped us improve the clarity and rigor of our work. Below, we respond to each major comment in detail. In cases where additional details or experiments were requested, we have updated the manuscript accordingly and indicate the changes.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): The per-sample gradient norm computation is not specified (e.g., exact loss function used—task loss only, or including safety terms?—norm type, and whether gradients are w.r.t. all parameters or a subset). This detail is load-bearing for the selection rule and for reproducing the claim that high-gradient samples cause greater degradation.

Authors: We thank the referee for highlighting this important omission. In the revised manuscript, we have added a precise specification in §4 (Experimental Setup) and a dedicated paragraph in §3.1. The per-sample gradient norm is computed exclusively with respect to the task fine-tuning loss (cross-entropy on the new domain data only), using the L2 norm over the full set of model parameters. No safety-related loss terms are included, as the method is intentionally data-centric and does not require curated safety data. Pseudocode for the full selection procedure has also been added to the appendix to ensure full reproducibility. revision: yes
Referee: [§4.2] §4.2 (Empirical Analysis): The observation that high-gradient samples drive greater safety degradation lacks controls or ablations for obvious confounders such as token length, perplexity, or sample difficulty. Without these, it remains unclear whether gradient magnitude is causally linked to alignment drift or spuriously correlated, undermining the motivation for the filtering method.

Authors: This concern is well-taken. To isolate the role of gradient magnitude, we have added new controlled ablations to the revised §4.2. We construct matched subsets of high-gradient samples versus moderate-gradient samples that are balanced on token length, base-model perplexity, and per-sample loss (as a proxy for difficulty). Even after matching, high-gradient samples produce substantially larger safety degradation across refusal, truthfulness, and commonsense metrics. These results indicate that gradient magnitude captures effects beyond the listed confounders and reinforce the motivation for the proposed filtering rule. revision: yes
Referee: [§4.3] §4.3 (Robustness and Generalization): No experiment tests whether the same high-gradient filtering rule preserves (or harms) performance when the safety metric is replaced by an unrelated downstream task. This is needed to confirm that the selection selectively protects alignment rather than simply discarding informative or difficult samples in general.

Authors: We agree that a direct test of selectivity is valuable. In the revised §4.3 we report a new experiment in which the safety evaluation is replaced by performance on a held-out unrelated downstream task (a 5k-example subset of MMLU). Applying the identical high-gradient filtering rule yields task performance that is statistically indistinguishable from or slightly better than standard fine-tuning, while the same rule continues to protect safety when safety metrics are restored. This supports that the method does not indiscriminately remove informative samples and is particularly effective for alignment preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic with independent experimental validation

full rationale

The paper proposes a gradient-magnitude filter as a practical heuristic for continual fine-tuning, justified by direct empirical observations on sample contributions rather than any closed-form derivation. No equations reduce a claimed prediction to fitted inputs by construction, no self-citation chain supplies the core premise, and the selection rule is not tautological with the reported performance metrics. The method is presented as data-centric and externally testable across models and benchmarks, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical correlation between gradient size and safety drift; no new physical entities or free parameters beyond the selection ratio (shown to be robust) are introduced.

axioms (1)

domain assumption Gradient norm of a sample with respect to model parameters indicates the degree to which that sample will shift the model away from its current safety-aligned state.
Invoked to justify discarding high-gradient samples; appears in the description of the empirical analysis.

pith-pipeline@v0.9.0 · 5425 in / 1269 out tokens · 34880 ms · 2026-05-10T07:13:18.666711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free dolly: Introducing the world’s first truly open instruction- tuned llm. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

Safe delta: Consistently preserving safety when fine-tuning llms on diverse datasets.arXiv preprint arXiv:2505.12038. Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2023. An empirical study of catastrophic forgetting in large language mod- els during continual fine-tuning.arXiv preprint arXiv:2308.08747. Mantas Mazeika, Long Phan, Xuw...

work page arXiv 2023
[3]

Representation noising: A defence mechanism against harmful finetuning,

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- tions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume...

work page arXiv 2018
[4]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Day- iheng Liu, Fei Huang, Haoran Wei, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen2.5 Technical Report

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

Shadow alignment: The ease of subvert- ing safely-aligned language models.arXiv preprint arXiv:2310.02949. Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma

work page arXiv
[8]

Spurious forgetting in continual learning of language models.arXiv preprint arXiv:2501.13453, 2025

Spurious forgetting in continual learning of language models.arXiv preprint arXiv:2501.13453. Andy Zhou, Bo Li, and Haohan Wang. 2024. Robust prompt optimization for defending language mod- els against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrik- son. 2023. U...

work page arXiv 2024
[9]

and experience replay (Rolnick et al., 2019). LLM-specific studies reveal that forgetting inten- sifies with model scale (Luo et al., 2023) and that vertical continuity (general to specific) dif- fers from horizontal continuity (across time and domains) (Shi et al., 2024). Parameter-efficient approaches like O-LoRA learn tasks in orthog- onal subspaces to...

2019
[10]

A language model pθ trained on dataset D achieves compression rate γD pθ =H(D)/H pθ(D), where H(D) is the entropy of the data and Hpθ(D) is the cross-entropy under the model

model language model training through the lens of data compression. A language model pθ trained on dataset D achieves compression rate γD pθ =H(D)/H pθ(D), where H(D) is the entropy of the data and Hpθ(D) is the cross-entropy under the model. Elastic Force Formulation.The elastic force exerted by dataset Di on model parameters is Felastic ∝ |D i| ·∆D KL(p...

2024
[11]

alignment tension

characterizes aggregate behavior under fine- tuning, we show that individual samples activate elastic reversion unequally. High-gradient samples, where the aligned model’s predictions diverge sub- stantially from fine-tuning targets, represent points of “alignment tension” that preferentially activate the reversion force. C Extended Experimental Setup C.1...

2024
[12]

Personal information disclosure
[13]

Scams or fraud A response is classified as “UNSAFE” if Llama- Guard detects violations in any category. The At- tack Success Rate is computed as the proportion of responses classified as unsafe: ASR= |{(pi, ri) :LLAMAGUARD(p i, ri) =UNSAFE}| N (8) where N= 520 is the total number of AdvBench queries. Task Performance.Evaluated using lm- evaluation-harness...

2024
[14]

Large parameter updates reverse task-specific representations (causing forgetting)
[15]

Updates push parameters toward pretrained distributions (elastic reversion)
[16]

Consistent improvements across all metrics, especially where safety gains are largest, confirm that gradient-based selection taps into fundamental continual learning dynamics

Extreme updates risk crossing safety basin boundaries (alignment collapse) Moderate-Gi addresses all three by preventing the high-gradient samples that trigger them. Consistent improvements across all metrics, especially where safety gains are largest, confirm that gradient-based selection taps into fundamental continual learning dynamics. E.4 Why Not Gra...