SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Pith reviewed 2026-05-22 22:28 UTC · model grok-4.3
The pith
Selective layer merging based on cosine similarity restores safety in fine-tuned LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility.
What carries the argument
Cosine similarity criterion applied layer by layer to decide whether to merge parameters from the safety-aligned model into the fine-tuned model.
If this is right
- Harmful outputs decrease across multiple LLMs and downstream tasks without new training.
- Task utility stays the same or rises compared with the fine-tuned model alone.
- The defense works after any fine-tuning step and requires no changes to the original training process.
- Layer-wise selection limits changes to only the parts of the model that have shifted away from safe behavior.
Where Pith is reading between the lines
- The same selective merge could be tried on other unintended side effects of fine-tuning, such as reduced factual consistency.
- Running the cosine check on models fine-tuned with larger or more adversarial datasets would test how far the trigger generalizes.
- Combining the merge step with existing safety filters might create a two-stage guard without much added cost.
Load-bearing premise
Cosine similarity between layers of the fine-tuned model and the safety-aligned model reliably flags the layers that have lost safe behavior.
What would settle it
A direct test showing that SafeMERGE applied to a fine-tuned model produces harmful response rates on safety benchmarks that remain as high as those of the unmodified fine-tuned model would disprove the central claim.
read the original abstract
Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that restores safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective, layer-wise merging offers a robust safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple yet effective post-fine-tuning defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SafeMERGE, a lightweight post-fine-tuning framework that selectively merges layers of a fine-tuned LLM with those of a safety-aligned model only when cosine similarity indicates deviation from safe behavior. It claims this restores refusal on harmful prompts more effectively than prior defenses across four LLMs and multiple tasks while preserving or improving downstream utility.
Significance. If the central mechanism is shown to be safety-specific rather than a generic averaging effect, the result would be significant: it supplies a simple, implementation-light safeguard against safety erosion that avoids custom algorithms and large utility costs. The selective layer-wise approach could be broadly applicable if supported by targeted validation.
major comments (2)
- [Abstract] Abstract: the claim that cosine similarity between layers 'identifies deviation from safe behavior' and that selective merging 'restores safety' is load-bearing for the central result, yet the manuscript provides no layer-wise safety probes, refusal-circuit ablations, or causal interventions demonstrating that the chosen layers control harmful outputs rather than merely averaging any drifted weights.
- [Abstract] Abstract: reported reductions in harmful outputs are stated without accompanying details on baselines, exact metrics (e.g., harm score definitions), statistical tests, or data exclusion criteria, preventing verification that the superiority claim is supported by the experimental design.
minor comments (1)
- The cosine similarity threshold is described as a free parameter; the manuscript should report its selection procedure and sensitivity analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that cosine similarity between layers 'identifies deviation from safe behavior' and that selective merging 'restores safety' is load-bearing for the central result, yet the manuscript provides no layer-wise safety probes, refusal-circuit ablations, or causal interventions demonstrating that the chosen layers control harmful outputs rather than merely averaging any drifted weights.
Authors: We agree this is a substantive point. The current manuscript justifies the cosine similarity criterion through empirical results showing that selective merging outperforms both full-model merging and non-selective baselines across four LLMs, which would be unexpected under pure generic averaging. However, we lack explicit layer-wise safety probes or causal ablations. In revision we will add a dedicated analysis section that (i) reports per-layer refusal rates on held-out harmful prompts before/after merging and (ii) compares against random layer selection controls to test whether the similarity threshold isolates safety-relevant layers. revision: yes
-
Referee: [Abstract] Abstract: reported reductions in harmful outputs are stated without accompanying details on baselines, exact metrics (e.g., harm score definitions), statistical tests, or data exclusion criteria, preventing verification that the superiority claim is supported by the experimental design.
Authors: The abstract is intentionally concise, but the full paper (Sections 4 and 5) specifies the baselines (SafeEdit, Safety-FT, full merging), harm metrics (HarmBench and AdvBench attack success rates with exact scoring rubrics), statistical tests (paired t-tests with p<0.05 reported), and data exclusion rules. We will revise the abstract to include one-sentence definitions of the primary harm metric and note that all comparisons use the same evaluation protocol with reported significance levels. revision: partial
Circularity Check
No circularity; method uses external safety-aligned reference and standard metric with empirical validation.
full rationale
The paper defines SafeMERGE via direct comparison (cosine similarity) to an independently provided safety-aligned model, then reports empirical harm reduction on held-out tasks. No equations reduce a claimed prediction to a fitted parameter on the same data, no self-citation chain justifies the core criterion, and the safety claim rests on external benchmarks rather than self-referential definitions or renamings. The derivation chain is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- cosine similarity threshold
axioms (1)
- domain assumption Cosine similarity between corresponding layers of fine-tuned and safety-aligned models measures deviation from safe behavior
Forward citations
Cited by 2 Pith papers
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
Robust Policy Optimization to Prevent Catastrophic Forgetting
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.