Recognition: unknown
Continual Safety Alignment via Gradient-Based Sample Selection
Pith reviewed 2026-05-10 07:13 UTC · model grok-4.3
The pith
Filtering out high-gradient samples during fine-tuning preserves safety alignment in language models while they learn new tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training samples contribute unequally to alignment degradation. High-gradient samples during fine-tuning cause greater safety loss by pushing the model toward its pretrained distribution, while moderate-gradient samples enable effective task learning with minimal alignment impact. Selecting only moderate-gradient samples therefore maintains refusal behaviors, truthfulness, and commonsense reasoning at competitive levels across continual domain adaptation scenarios.
What carries the argument
Gradient-based sample selection that excludes high-gradient training samples to reduce safety degradation during fine-tuning.
Load-bearing premise
Gradient magnitude reliably flags the samples most responsible for safety drift, and removing those samples will not block the model from learning the new task on the remaining data.
What would settle it
An experiment in which safety still degrades when training only on moderate-gradient samples, or in which task performance collapses after high-gradient samples are removed.
Figures
read the original abstract
Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the role of individual training samples in causing safety alignment drift during continual fine-tuning of LLMs on new tasks. Through empirical analysis, it claims that samples with high gradient magnitudes (computed w.r.t. the fine-tuning loss) disproportionately drive safety degradation and push models toward pretrained distributions, while moderate-gradient samples support task learning with less alignment loss. The authors propose a simple gradient-based sample selection method that filters high-gradient samples during training. They report that this approach substantially improves preservation of safety behaviors (e.g., refusal of harmful requests, truthfulness) while maintaining competitive task performance across multiple model families, continual domain tasks, selection ratios, task orderings, and attack benchmarks, without requiring curated safe data or architectural modifications.
Significance. If the central empirical claims hold after addressing the noted gaps, this offers a practical, data-centric solution to the safety-forgetting problem in continual LLM adaptation. The approach is lightweight, requires no extra data or model changes, and provides an interpretable view into sample contributions to alignment drift. This could be valuable for real-world deployment where models must adapt to new domains while retaining safety properties. The reported robustness across settings and model families strengthens potential impact if the method generalizes beyond the evaluated cases.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Setup): The per-sample gradient norm computation is not specified (e.g., exact loss function used—task loss only, or including safety terms?—norm type, and whether gradients are w.r.t. all parameters or a subset). This detail is load-bearing for the selection rule and for reproducing the claim that high-gradient samples cause greater degradation.
- [§4.2] §4.2 (Empirical Analysis): The observation that high-gradient samples drive greater safety degradation lacks controls or ablations for obvious confounders such as token length, perplexity, or sample difficulty. Without these, it remains unclear whether gradient magnitude is causally linked to alignment drift or spuriously correlated, undermining the motivation for the filtering method.
- [§4.3] §4.3 (Robustness and Generalization): No experiment tests whether the same high-gradient filtering rule preserves (or harms) performance when the safety metric is replaced by an unrelated downstream task. This is needed to confirm that the selection selectively protects alignment rather than simply discarding informative or difficult samples in general.
minor comments (2)
- [§3] Clarify the precise definition of 'selection ratio' and how the threshold for 'high-gradient' is determined (fixed percentile, adaptive, etc.) in the method description.
- [§4] Add statistical significance tests or error bars to the reported gains across benchmarks to strengthen the 'consistent empirical gains' claim.
Simulated Author's Rebuttal
We are grateful to the referee for their thorough review and valuable suggestions. These comments have helped us improve the clarity and rigor of our work. Below, we respond to each major comment in detail. In cases where additional details or experiments were requested, we have updated the manuscript accordingly and indicate the changes.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): The per-sample gradient norm computation is not specified (e.g., exact loss function used—task loss only, or including safety terms?—norm type, and whether gradients are w.r.t. all parameters or a subset). This detail is load-bearing for the selection rule and for reproducing the claim that high-gradient samples cause greater degradation.
Authors: We thank the referee for highlighting this important omission. In the revised manuscript, we have added a precise specification in §4 (Experimental Setup) and a dedicated paragraph in §3.1. The per-sample gradient norm is computed exclusively with respect to the task fine-tuning loss (cross-entropy on the new domain data only), using the L2 norm over the full set of model parameters. No safety-related loss terms are included, as the method is intentionally data-centric and does not require curated safety data. Pseudocode for the full selection procedure has also been added to the appendix to ensure full reproducibility. revision: yes
-
Referee: [§4.2] §4.2 (Empirical Analysis): The observation that high-gradient samples drive greater safety degradation lacks controls or ablations for obvious confounders such as token length, perplexity, or sample difficulty. Without these, it remains unclear whether gradient magnitude is causally linked to alignment drift or spuriously correlated, undermining the motivation for the filtering method.
Authors: This concern is well-taken. To isolate the role of gradient magnitude, we have added new controlled ablations to the revised §4.2. We construct matched subsets of high-gradient samples versus moderate-gradient samples that are balanced on token length, base-model perplexity, and per-sample loss (as a proxy for difficulty). Even after matching, high-gradient samples produce substantially larger safety degradation across refusal, truthfulness, and commonsense metrics. These results indicate that gradient magnitude captures effects beyond the listed confounders and reinforce the motivation for the proposed filtering rule. revision: yes
-
Referee: [§4.3] §4.3 (Robustness and Generalization): No experiment tests whether the same high-gradient filtering rule preserves (or harms) performance when the safety metric is replaced by an unrelated downstream task. This is needed to confirm that the selection selectively protects alignment rather than simply discarding informative or difficult samples in general.
Authors: We agree that a direct test of selectivity is valuable. In the revised §4.3 we report a new experiment in which the safety evaluation is replaced by performance on a held-out unrelated downstream task (a 5k-example subset of MMLU). Applying the identical high-gradient filtering rule yields task performance that is statistically indistinguishable from or slightly better than standard fine-tuning, while the same rule continues to protect safety when safety metrics are restored. This supports that the method does not indiscriminately remove informative samples and is particularly effective for alignment preservation. revision: yes
Circularity Check
No circularity: empirical heuristic with independent experimental validation
full rationale
The paper proposes a gradient-magnitude filter as a practical heuristic for continual fine-tuning, justified by direct empirical observations on sample contributions rather than any closed-form derivation. No equations reduce a claimed prediction to fitted inputs by construction, no self-citation chain supplies the core premise, and the selection rule is not tautological with the reported performance metrics. The method is presented as data-centric and externally testable across models and benchmarks, satisfying the criteria for a self-contained, non-circular contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient norm of a sample with respect to model parameters indicates the degree to which that sample will shift the model away from its current safety-aligned state.
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free dolly: Introducing the world’s first truly open instruction- tuned llm. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang
Safe delta: Consistently preserving safety when fine-tuning llms on diverse datasets.arXiv preprint arXiv:2505.12038. Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2023. An empirical study of catastrophic forgetting in large language mod- els during continual fine-tuning.arXiv preprint arXiv:2308.08747. Mantas Mazeika, Long Phan, Xuw...
-
[3]
Representation noising: A defence mechanism against harmful finetuning,
Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- tions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume...
-
[4]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others
Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others
-
[5]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Day- iheng Liu, Fei Huang, Haoran Wei, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin
Shadow alignment: The ease of subvert- ing safely-aligned language models.arXiv preprint arXiv:2310.02949. Junhao Zheng, Xidi Cai, Shengjie Qiu, and Qianli Ma
-
[8]
Spurious forgetting in continual learning of language models.arXiv preprint arXiv:2501.13453, 2025
Spurious forgetting in continual learning of language models.arXiv preprint arXiv:2501.13453. Andy Zhou, Bo Li, and Haohan Wang. 2024. Robust prompt optimization for defending language mod- els against jailbreaking attacks.Advances in Neural Information Processing Systems, 37:40184–40211. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrik- son. 2023. U...
-
[9]
and experience replay (Rolnick et al., 2019). LLM-specific studies reveal that forgetting inten- sifies with model scale (Luo et al., 2023) and that vertical continuity (general to specific) dif- fers from horizontal continuity (across time and domains) (Shi et al., 2024). Parameter-efficient approaches like O-LoRA learn tasks in orthog- onal subspaces to...
2019
-
[10]
A language model pθ trained on dataset D achieves compression rate γD pθ =H(D)/H pθ(D), where H(D) is the entropy of the data and Hpθ(D) is the cross-entropy under the model
model language model training through the lens of data compression. A language model pθ trained on dataset D achieves compression rate γD pθ =H(D)/H pθ(D), where H(D) is the entropy of the data and Hpθ(D) is the cross-entropy under the model. Elastic Force Formulation.The elastic force exerted by dataset Di on model parameters is Felastic ∝ |D i| ·∆D KL(p...
2024
-
[11]
alignment tension
characterizes aggregate behavior under fine- tuning, we show that individual samples activate elastic reversion unequally. High-gradient samples, where the aligned model’s predictions diverge sub- stantially from fine-tuning targets, represent points of “alignment tension” that preferentially activate the reversion force. C Extended Experimental Setup C.1...
2024
-
[12]
Personal information disclosure
-
[13]
Scams or fraud A response is classified as “UNSAFE” if Llama- Guard detects violations in any category. The At- tack Success Rate is computed as the proportion of responses classified as unsafe: ASR= |{(pi, ri) :LLAMAGUARD(p i, ri) =UNSAFE}| N (8) where N= 520 is the total number of AdvBench queries. Task Performance.Evaluated using lm- evaluation-harness...
2024
-
[14]
Large parameter updates reverse task-specific representations (causing forgetting)
-
[15]
Updates push parameters toward pretrained distributions (elastic reversion)
-
[16]
Consistent improvements across all metrics, especially where safety gains are largest, confirm that gradient-based selection taps into fundamental continual learning dynamics
Extreme updates risk crossing safety basin boundaries (alignment collapse) Moderate-Gi addresses all three by preventing the high-gradient samples that trigger them. Consistent improvements across all metrics, especially where safety gains are largest, confirm that gradient-based selection taps into fundamental continual learning dynamics. E.4 Why Not Gra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.