SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Aladin Djuhera; Farhan Ahmed; Holger Boche; Swanand Ravindra Kadhe; Syed Zawad

arxiv: 2503.17239 · v3 · submitted 2025-03-21 · 💻 cs.CL · cs.AI

SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Aladin Djuhera , Swanand Ravindra Kadhe , Farhan Ahmed , Syed Zawad , Holger Boche This is my paper

Pith reviewed 2026-05-22 22:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM safetymodel mergingfine-tuningsafety alignmentcosine similaritypost-fine-tuning defenseharmful outputs

0 comments

The pith

Selective layer merging based on cosine similarity restores safety in fine-tuned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-tuning LLMs for specific tasks often reduces their resistance to harmful prompts. SafeMERGE addresses this by comparing each layer of the fine-tuned model to the corresponding layer in the original safety-aligned model and merging only those layers whose outputs have diverged. The method uses a simple cosine similarity check to decide which layers to update, avoiding the need for new training or complex realignment procedures. If the approach holds, it provides a lightweight post-training step that limits harmful outputs while leaving task performance intact or even improved.

Core claim

SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility.

What carries the argument

Cosine similarity criterion applied layer by layer to decide whether to merge parameters from the safety-aligned model into the fine-tuned model.

If this is right

Harmful outputs decrease across multiple LLMs and downstream tasks without new training.
Task utility stays the same or rises compared with the fine-tuned model alone.
The defense works after any fine-tuning step and requires no changes to the original training process.
Layer-wise selection limits changes to only the parts of the model that have shifted away from safe behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective merge could be tried on other unintended side effects of fine-tuning, such as reduced factual consistency.
Running the cosine check on models fine-tuned with larger or more adversarial datasets would test how far the trigger generalizes.
Combining the merge step with existing safety filters might create a two-stage guard without much added cost.

Load-bearing premise

Cosine similarity between layers of the fine-tuned model and the safety-aligned model reliably flags the layers that have lost safe behavior.

What would settle it

A direct test showing that SafeMERGE applied to a fine-tuned model produces harmful response rates on safety benchmarks that remain as high as those of the unmodified fine-tuned model would disprove the central claim.

read the original abstract

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that restores safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across four LLMs and several tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective, layer-wise merging offers a robust safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple yet effective post-fine-tuning defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeMERGE gives a lightweight selective merge fix for safety loss after fine-tuning, but cosine similarity does not clearly isolate safety-specific layers.

read the letter

SafeMERGE is a post-fine-tuning method that selectively merges layers from the fine-tuned and safety-aligned models using a cosine similarity threshold to restore safety alignment. The approach is lightweight and claims to preserve or improve utility. It does well by avoiding the need for custom retraining algorithms that other safety realignment methods often require. The abstract shows results across four LLMs where harmful outputs drop while task performance holds up, which addresses a common practical problem in deploying fine-tuned models. The selective merging based on layer deviation is the novel part here, distinguishing it from blanket merging techniques. The soft spot is in the justification for the merging criterion. Cosine similarity detects parameter changes but does not specifically isolate safety-related deviations from other task adaptations. The paper would be stronger with evidence like layer ablations or safety probes confirming that the selected layers drive the refusal behavior. The stress-test concern holds based on the abstract, as no such validation is described. Results are presented as consistent, but without full details on experimental setup, it's difficult to gauge the strength of the evidence. This paper is for applied researchers and engineers dealing with LLM fine-tuning and safety. Readers looking for easy-to-implement defenses will get the most value. It deserves serious referee attention because the problem is relevant and the method is accessible, though the mechanism claims require closer examination. Recommendation: send it for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SafeMERGE, a lightweight post-fine-tuning framework that selectively merges layers of a fine-tuned LLM with those of a safety-aligned model only when cosine similarity indicates deviation from safe behavior. It claims this restores refusal on harmful prompts more effectively than prior defenses across four LLMs and multiple tasks while preserving or improving downstream utility.

Significance. If the central mechanism is shown to be safety-specific rather than a generic averaging effect, the result would be significant: it supplies a simple, implementation-light safeguard against safety erosion that avoids custom algorithms and large utility costs. The selective layer-wise approach could be broadly applicable if supported by targeted validation.

major comments (2)

[Abstract] Abstract: the claim that cosine similarity between layers 'identifies deviation from safe behavior' and that selective merging 'restores safety' is load-bearing for the central result, yet the manuscript provides no layer-wise safety probes, refusal-circuit ablations, or causal interventions demonstrating that the chosen layers control harmful outputs rather than merely averaging any drifted weights.
[Abstract] Abstract: reported reductions in harmful outputs are stated without accompanying details on baselines, exact metrics (e.g., harm score definitions), statistical tests, or data exclusion criteria, preventing verification that the superiority claim is supported by the experimental design.

minor comments (1)

The cosine similarity threshold is described as a free parameter; the manuscript should report its selection procedure and sensitivity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that cosine similarity between layers 'identifies deviation from safe behavior' and that selective merging 'restores safety' is load-bearing for the central result, yet the manuscript provides no layer-wise safety probes, refusal-circuit ablations, or causal interventions demonstrating that the chosen layers control harmful outputs rather than merely averaging any drifted weights.

Authors: We agree this is a substantive point. The current manuscript justifies the cosine similarity criterion through empirical results showing that selective merging outperforms both full-model merging and non-selective baselines across four LLMs, which would be unexpected under pure generic averaging. However, we lack explicit layer-wise safety probes or causal ablations. In revision we will add a dedicated analysis section that (i) reports per-layer refusal rates on held-out harmful prompts before/after merging and (ii) compares against random layer selection controls to test whether the similarity threshold isolates safety-relevant layers. revision: yes
Referee: [Abstract] Abstract: reported reductions in harmful outputs are stated without accompanying details on baselines, exact metrics (e.g., harm score definitions), statistical tests, or data exclusion criteria, preventing verification that the superiority claim is supported by the experimental design.

Authors: The abstract is intentionally concise, but the full paper (Sections 4 and 5) specifies the baselines (SafeEdit, Safety-FT, full merging), harm metrics (HarmBench and AdvBench attack success rates with exact scoring rubrics), statistical tests (paired t-tests with p<0.05 reported), and data exclusion rules. We will revise the abstract to include one-sentence definitions of the primary harm metric and note that all comparisons use the same evaluation protocol with reported significance levels. revision: partial

Circularity Check

0 steps flagged

No circularity; method uses external safety-aligned reference and standard metric with empirical validation.

full rationale

The paper defines SafeMERGE via direct comparison (cosine similarity) to an independently provided safety-aligned model, then reports empirical harm reduction on held-out tasks. No equations reduce a claimed prediction to a fitted parameter on the same data, no self-citation chain justifies the core criterion, and the safety claim rests on external benchmarks rather than self-referential definitions or renamings. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that cosine similarity serves as a valid proxy for safety deviation; no free parameters or invented entities are explicitly named in the abstract.

free parameters (1)

cosine similarity threshold
Used to decide which layers deviate enough to warrant merging; value and selection method not specified in abstract.

axioms (1)

domain assumption Cosine similarity between corresponding layers of fine-tuned and safety-aligned models measures deviation from safe behavior
Invoked to trigger selective merging; central to the method's decision rule.

pith-pipeline@v0.9.0 · 5719 in / 1184 out tokens · 50971 ms · 2026-05-22T22:28:30.086423+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Robust Policy Optimization to Prevent Catastrophic Forgetting
cs.LG 2026-02 unverdicted novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.