pith. sign in

arxiv: 2606.06519 · v1 · pith:XESIA6UFnew · submitted 2026-06-02 · 💻 cs.AI · cs.LG

SafeGene: Reusable Adapters for Transferable Safety Alignment

Pith reviewed 2026-06-28 09:56 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords safety alignmentLLM fine-tuningadapter modulestransferable safetyharmful response reductionmodel discrepanciessafety-utility trade-offreusable adapters
0
0 comments X

The pith

SafeGene extracts reusable safety adapters from aligned and degraded model pairs to restore alignment in fine-tuned LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SafeGene as a way to recover safety capabilities that are lost when open-weight language models are fine-tuned for specific tasks. It does this by treating safety as a separate adapter that can be reused across different tasks within the same model family. The method identifies safety vectors from the differences between the original aligned model and its degraded version after task adaptation. These vectors are then refined using data-aware layer selection and applied back with minimal recalibration using a few examples. This addresses the recurring problem of safety degradation in customized models without needing full retraining each time.

Core claim

SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned-degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety-utility trade-off.

What carries the argument

The reusable safety-adapter module derived from aligned-degraded model discrepancies, refined by data-aware layer selection and applied via few-shot layer-wise coefficient recalibration.

If this is right

  • SafeGene-enhanced models exhibit lower rates of harmful responses to malicious prompts compared to standard fine-tuned models.
  • Task-specific performance on downstream applications remains largely unchanged after adding the safety adapter.
  • The approach achieves a better balance between safety and utility than other methods for safe adaptation.
  • It enables cross-task reuse of safety capabilities within compatible model architectures without repeated full safety training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the adapters prove architecture-specific, they could still be pre-computed for popular model families and distributed separately from task updates.
  • Similar discrepancy-based extraction might apply to other alignment goals like truthfulness or helpfulness.
  • Layer selection could be automated further to reduce the need for manual data-aware choices in new tasks.

Load-bearing premise

Safety capability can be represented as an independent reusable adapter that is decoupled from task-specific updates and can be obtained from the discrepancies between aligned and degraded models.

What would settle it

A test showing that models with the SafeGene adapter applied produce harmful responses at rates similar to or higher than the fine-tuned model without it, or that task performance drops substantially.

Figures

Figures reproduced from arXiv: 2606.06519 by Fu Feng, Jing Wang, Xin Geng, Yanghan Wang, Zhiqiang Kou.

Figure 1
Figure 1. Figure 1: Motivation of SafeGene. Continual fine￾tuning on newly collected data improves the utility of next-generation models but can weaken safety align￾ment. Within each architecture-compatible model family, a SafeGene module is instantiated once from reusable safety vectors and reused across evolving down￾stream task-adapted models through few-shot recalibra￾tion, restoring safety while preserving task utility. … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SafeGene. SafeGene decouples safety capability from task-specific updates by representing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: API-based safety evaluation on Qwen2.5-7B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of data-aware layer selection [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that SafeGene creates reusable safety-adapter modules for LLMs by deriving safety vectors from discrepancies between aligned and task-degraded models, refining them via data-aware layer selection and few-shot layer-wise recalibration to achieve cross-task transfer; experiments across model families, tasks, and judges reportedly show reduced harmful response rates with preserved downstream performance, outperforming prior safe adaptation methods on the safety-utility trade-off.

Significance. If the decoupling of safety from task directions holds and the adapters prove reusable, the approach would address a practical recurring problem in LLM fine-tuning by enabling efficient, architecture-family-level safety recovery without full retraining. The reported experiments spanning multiple model families and downstream tasks provide a positive basis for assessing generalizability, though this depends on the supporting details.

major comments (3)
  1. [Section 3] Section 3 (method): the extraction of safety representations from aligned-degraded discrepancies includes data-aware layer selection and few-shot recalibration but provides no explicit mechanism (e.g., orthogonal projection or task-ablated controls) to isolate safety loss from task-specific weight updates. This is load-bearing for the central reusable-adapter claim.
  2. [Experiments] Experiments section: improved safety-utility trade-offs are reported, yet cosine similarity between safety vectors derived from different degradation tasks is not reported. Without this, it is unclear whether the vectors are task-agnostic as required for cross-task reuse.
  3. [Abstract and experimental sections] Abstract and experimental sections: the central claim rests on experimental outcomes across model families and judges, but the manuscript provides no derivation, detailed data splits, error analysis, or statistical significance tests, making robustness difficult to evaluate.
minor comments (1)
  1. Clarify notation for layer selection and coefficient recalibration to ensure reproducibility; add explicit definitions for any derived quantities.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the presentation of our central claims. We respond to each major comment below and indicate revisions where they strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (method): the extraction of safety representations from aligned-degraded discrepancies includes data-aware layer selection and few-shot recalibration but provides no explicit mechanism (e.g., orthogonal projection or task-ablated controls) to isolate safety loss from task-specific weight updates. This is load-bearing for the central reusable-adapter claim.

    Authors: The safety vector is obtained directly as the parameter difference between the original aligned model and the same model after task-specific fine-tuning on non-adversarial data. Because the fine-tuning data contains no safety-related content, the discrepancy is designed to isolate safety degradation; data-aware layer selection then retains only those layers whose activation patterns correlate most strongly with safety metrics on held-out probes. We agree that an explicit isolation step such as an orthogonal projection or task-ablated control would make the argument more rigorous. We will therefore add (i) a short derivation showing that the expected task component is minimized under the chosen degradation protocol and (ii) an ablation that recomputes vectors after removing task-specific directions via PCA on a held-out task set. These additions will appear in a revised Section 3. revision: yes

  2. Referee: [Experiments] Experiments section: improved safety-utility trade-offs are reported, yet cosine similarity between safety vectors derived from different degradation tasks is not reported. Without this, it is unclear whether the vectors are task-agnostic as required for cross-task reuse.

    Authors: We computed pairwise cosine similarities between safety vectors obtained from distinct downstream tasks (e.g., math, code, summarization) during development; the values range from 0.62 to 0.81 across layers and model families, consistent with partial task-agnostic structure. These numbers were omitted from the submitted manuscript to keep the experimental section concise. We will insert a new table (or supplementary figure) reporting the full similarity matrix together with layer-wise statistics, thereby directly supporting the cross-task reuse claim. revision: yes

  3. Referee: [Abstract and experimental sections] Abstract and experimental sections: the central claim rests on experimental outcomes across model families and judges, but the manuscript provides no derivation, detailed data splits, error analysis, or statistical significance tests, making robustness difficult to evaluate.

    Authors: The experimental protocol, including exact data splits, prompt templates, and judge instructions, is already documented in Appendix B and the supplementary material. Standard deviations across three random seeds are reported for all main results. We acknowledge, however, that a formal derivation of the safety-vector extraction and explicit statistical significance tests (paired t-tests or Wilcoxon tests) are absent. We will add a concise derivation in Section 3, expand the error analysis with per-task breakdowns, and include p-values for the safety-utility improvements in the revised experimental section. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical extraction method stands independent of its evaluation metrics

full rationale

The paper presents SafeGene as an empirical procedure that extracts candidate safety vectors from weight discrepancies between an aligned model and a task-degraded version, followed by data-aware layer selection and few-shot recalibration. No equations appear in the abstract, and the description does not define the safety representation in terms of the downstream safety or utility scores it is later evaluated on. Layer selection and coefficient recalibration are described as procedural steps rather than a closed-form fit that is then relabeled as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claim (cross-task reuse with improved safety-utility trade-off) is framed as an experimental outcome rather than a mathematical identity. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach assumes safety can be isolated as a transferable vector; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5704 in / 1093 out tokens · 12712 ms · 2026-06-28T09:56:23.811101+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chap- ter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

  2. [2]

    SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

    Safe- merge: Preserving safety alignment in fine-tuned large language models via selective layer-wise model merging.arXiv preprint arXiv:2503.17239. Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, and Hung-yi Lee

  3. [3]

    Kathleen C Fraser, Hillary Dawkins, Isar Nejadgholi, and Svetlana Kiritchenko

    Safeguard fine-tuned llms through pre-and post-tuning model merging.arXiv preprint arXiv:2412.19512. Kathleen C Fraser, Hillary Dawkins, Isar Nejadgholi, and Svetlana Kiritchenko

  4. [4]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793. Yichen Gong, Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, and Xiaoyun Wang

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang

  6. [6]

    arXiv preprint arXiv:2408.09600 , year=

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3. Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, and Ling Liu. 2024a. Antidote: Post- fine-tuning safety alignment for large language mod- els against harmful fine-tuning.arXiv preprint arXiv:2408.09600. Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. 2025...

  7. [7]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

    Salora: Safety-alignment preserved low-rank adaptation.arXiv preprint arXiv:2501.01765. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

  8. [8]

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson

    A survey on training-free alignment of large language models.arXiv preprint arXiv:2508.09016. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson

  9. [9]

    Fine- tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations, volume 2024, pages 30988–31043. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang...

  10. [10]

    Qwen2.5 Technical Report

    Qwen2.5 technical report.Preprint, arXiv:2412.15115. Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin- Yu Chen, Amal Zouaq, Payel Das, and Sarath Chan- dar

  11. [11]

    InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–

    Glue: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–

  12. [12]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

    Learngene: Inheriting condensed knowledge from the ancestry model to descendant models.arXiv preprint arXiv:2305.02279. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

  13. [13]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Xiang Zhang, Junbo Zhao, and Yann LeCun

  14. [14]

    Univer- sal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043. Task Method Acc↑BT↓DR↓ ASR↓ AG News Fine-tuned 89.42 68.49 62.76 65.63 SafeLoRA 89.03 64.22 58.95 61.59 SafeMERGE 87.93 49.57 43.55 46.56 SaLoRA 88.87 51.99 47.76 49.88 SafeGene (Ours) 88.91 38.72 33.95 36.34 SST2 Fine-tuned 95.53 55.45 52.63 54.04...