DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Jie Pan; Jinbiao Zhu; Junbo Zhang; Qianli Zhou; Wen Jiang; Xinyang Deng

arxiv: 2606.00160 · v1 · pith:YQTEBTQAnew · submitted 2026-05-29 · 💻 cs.CR · cs.AI· cs.CL

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Junbo Zhang , Qianli Zhou , Xinyang Deng , Wen Jiang , Jie Pan , Jinbiao Zhu This is my paper

Pith reviewed 2026-06-28 22:07 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords LLM safetydata filteringbenign fine-tuningcompliance vectorsafety degradationinstruction tuningprojection shiftdata-centric defense

0 comments

The pith

Benign fine-tuning degrades LLM safety by increasing response compliance, which DataShield measures per sample to filter the riskiest data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models lose safety alignment even when fine-tuned only on helpful, non-harmful data. DataShield addresses this by scoring every training example according to how much it increases the model's tendency to comply with instructions. The scoring relies on a compliance vector extracted from model activations, an automatic choice of the most safety-critical layer, and a measure of how far each sample shifts the model along that vector. Experiments on multiple models and datasets show the method can separate high-risk from low-risk subsets with less computation and noise than earlier approaches. This data-centric filtering could let developers keep the usefulness of instruction tuning while limiting the safety cost.

Core claim

DataShield identifies potential safety-degrading samples in benign datasets by quantifying each sample's contribution to the model's overall compliance behavior. It does so through three components: Compliance Vector Extraction to capture compliance tendency, Compliance-Aware Score to find the optimal safety-critical layer, and Safety-degrading Sample Filtering based on projection shifts along the compliance direction. This allows efficient identification of high-risk samples without the high costs or noise of previous methods.

What carries the argument

The compliance vector, which represents the direction of increased response compliance in the model's activations, combined with Compliance-Aware Score selection of the optimal layer and projection-shift scoring of each sample.

Load-bearing premise

The rise in response compliance during benign fine-tuning is the main measurable driver of safety degradation and can be isolated to a single vector in one optimal layer.

What would settle it

An experiment in which models fine-tuned on the high-score subset show no greater safety degradation than models fine-tuned on a random subset of equal size.

Figures

Figures reproduced from arXiv: 2606.00160 by Jie Pan, Jinbiao Zhu, Junbo Zhang, Qianli Zhou, Wen Jiang, Xinyang Deng.

**Figure 2.** Figure 2: Two-dimensional projection of harmfulness perception and compliance [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise cosine similarity analysis using response-token-averaged [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-level projection scores on DirectHarm4 based on the final input [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: The overall architecture of the DataShield framework. It consists of three main phases: (a) extracting the compliance direction vector via paired [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 9.** Figure 9: Attack success rate (ASR) of Qwen2.5-7B-Instruct fine-tuned on data [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 7.** Figure 7: Compliance-aware scores across all layers for three aligned LLMs. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 10.** Figure 10: Attack success rate (ASR) of Llama-3.1-8B-Instruct fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 8.** Figure 8: Attack Success Rate (ASR) of Llama3-8B fine-tuned on data selected [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 11.** Figure 11: Correlation analysis between the compliance projection difference and the dataset compliance score evaluated by GPT-4 across three LLMs. The [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Response token length distribution of high-risk (top) and low [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 15.** Figure 15: Token-wise Kullback-Leibler (KL) divergence between benign fine [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 14.** Figure 14: Harmful topic distribution across three models on DirectHarm4, [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

read the original abstract

Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our method's effectiveness in identifying high-risk and low-risk data subsets. We also observe that open-ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data-centric defense methods. The source code is available at: https://github.com/ZJunBo/DataShield.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DataShield gives a workable filter for spotting benign samples that erode LLM safety via compliance shifts, but the single-vector assumption looks fragile without stronger controls.

read the letter

The paper's core offer is a scoring method that extracts a compliance vector from the model, picks a safety-critical layer with their CAS metric, then ranks samples by how far they shift projections along that direction. High-shift samples get filtered out before fine-tuning on things like Alpaca or Dolly. They run this on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B and report that the filtered subsets preserve safety better.

They release the code, which is useful, and they surface a couple of practical observations: open-ended questions and longer responses seem to drive more of the degradation. Those details line up with what people see in deployment.

The combination of vector extraction, automatic layer selection, and projection filtering is the new piece. Prior work on safety scoring exists, but this ties the score directly to measured compliance change during benign tuning.

The main weakness is the bet that safety loss is mostly a monotonic push along one compliance direction in one layer. If refusal or harm detection happens through separate circuits or across layers, the score will mis-rank samples that leave those intact. The abstract defines the score from the same compliance shifts it aims to prevent, which invites circularity unless the experiments cleanly separate the measurement from the outcome. No ablations shown here test whether removing the compliance component kills the predictive power.

This is for people who curate fine-tuning data or run alignment pipelines on open models. A reader who needs a concrete, runnable filter for this exact problem will find something to try.

It should go to peer review. The method is concrete, the code is out, and the problem is real even if the central mechanism needs more stress-testing.

Referee Report

2 major / 2 minor

Summary. The paper proposes DataShield for identifying safety-degrading samples in benign instruction-tuning datasets for LLMs. It observes that benign fine-tuning increases overall response compliance and quantifies each sample's contribution via a safety degradation score computed from Compliance Vector Extraction, a Compliance-Aware Score (CAS) that selects the optimal safety-critical layer, and projection-shift filtering along the compliance direction. Experiments on Llama-3-8B, Llama-3.1-8B and Qwen2.5-7B using Alpaca and Dolly datasets are said to show effective separation into high-risk and low-risk subsets; additional observations note that open-ended questions are more likely to trigger degradation and produce longer responses.

Significance. If the central claim holds and the compliance-vector approach reliably isolates safety degradation, the method would offer a computationally lighter data-centric defense compared with existing approaches, with potential to improve safety preservation during benign fine-tuning. The public release of source code strengthens reproducibility.

major comments (2)

[Abstract / Method description] The core assumption that safety degradation is primarily captured by monotonic projection shift along a single compliance vector extracted from one CAS-selected layer is load-bearing but untested against alternative mechanisms. If refusal circuitry, harm-detection heads, or multi-layer interactions contribute orthogonally, samples that shift the compliance projection while leaving those directions intact would receive incorrect scores, undermining the filtering claim. No ablation is described that removes the compliance component and verifies loss of predictive power for post-fine-tuning safety metrics.
[Abstract / Compliance Vector Extraction and Safety-degrading Sample Filtering] The safety degradation score is defined directly from observed compliance shifts during fine-tuning, creating a potential circularity: the metric used to filter data is itself derived from the compliance behavior the method seeks to mitigate. Without an independent safety evaluation (e.g., harmfulness benchmarks on held-out prompts before and after filtering) that is not reducible to the same compliance vector, it is unclear whether the identified subsets actually predict safety outcomes rather than merely re-expressing the compliance signal.

minor comments (2)

[Abstract] The abstract states that open-ended question answering is more likely to trigger degradation and yields longer responses; this observation should be supported by quantitative statistics (e.g., length distributions or trigger rates per category) in the results section.
[Method] Notation for the compliance vector, CAS computation, and projection shift should be introduced with explicit equations or pseudocode to allow readers to verify the claimed parameter-free or low-cost properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. Below we respond point-by-point to the major comments, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract / Method description] The core assumption that safety degradation is primarily captured by monotonic projection shift along a single compliance vector extracted from one CAS-selected layer is load-bearing but untested against alternative mechanisms. If refusal circuitry, harm-detection heads, or multi-layer interactions contribute orthogonally, samples that shift the compliance projection while leaving those directions intact would receive incorrect scores, undermining the filtering claim. No ablation is described that removes the compliance component and verifies loss of predictive power for post-fine-tuning safety metrics.

Authors: The method is motivated by the consistent empirical observation that benign fine-tuning increases overall response compliance, which in turn correlates with measured safety degradation across the evaluated models. The CAS procedure selects the layer whose compliance direction best predicts these shifts, providing a data-driven rather than arbitrary choice of representation. We agree that an explicit ablation comparing the full pipeline against a version that omits the compliance-projection component would strengthen the claim; we will add this experiment (and the corresponding analysis of predictive power on post-fine-tuning safety metrics) in the revision. revision: yes
Referee: [Abstract / Compliance Vector Extraction and Safety-degrading Sample Filtering] The safety degradation score is defined directly from observed compliance shifts during fine-tuning, creating a potential circularity: the metric used to filter data is itself derived from the compliance behavior the method seeks to mitigate. Without an independent safety evaluation (e.g., harmfulness benchmarks on held-out prompts before and after filtering) that is not reducible to the same compliance vector, it is unclear whether the identified subsets actually predict safety outcomes rather than merely re-expressing the compliance signal.

Authors: The degradation score is intentionally derived from compliance shifts because that is the observable mechanism the method targets. Nevertheless, the filtering decision is validated by independent downstream safety evaluations: after fine-tuning on the filtered versus unfiltered subsets, we measure harmfulness on held-out prompt sets using standard benchmarks that are not identical to the compliance-vector construction. These results show that high-risk subsets produce measurably larger safety degradation. We will expand the manuscript to explicitly separate the scoring procedure from the final safety metrics and to discuss the extent to which the two remain distinguishable. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core method defines a safety degradation score via compliance vector extraction and projection shift, motivated by the stated observation that benign fine-tuning increases compliance. No equations, self-citations, or steps are quoted that reduce the claimed identification of safety-degrading samples to a tautology or fitted input by construction. Experimental validation across models and datasets supplies independent content, and the derivation remains self-contained against external benchmarks rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5804 in / 1012 out tokens · 22821 ms · 2026-06-28T22:07:11.481721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Removing rlhf protections in gpt-4 via fine-tuning,

Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. B. Hashimoto, and D. Kang, “Removing rlhf protections in gpt-4 via fine-tuning,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 2024, pp. 681–687

2024
[2]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b,

P. Gade, S. Lermen, C. Rogers-Smith, and J. Ladish, “Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b,”arXiv preprint arXiv:2311.00117, 2023

work page arXiv 2023
[3]

I’m sorry

X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y . Wang, X. Zhao, and D. Lin, “Shadow alignment: The ease of subverting safely-aligned language models,”arXiv preprint arXiv:2310.02949, 2023

work page arXiv 2023
[4]

LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

S. Lermen, C. Rogers-Smith, and J. Ladish, “Lora fine-tuning effi- ciently undoes safety training in llama 2-chat 70b,”arXiv preprint arXiv:2310.20624, 2023

work page arXiv 2023
[5]

Fine-tuning aligned language models compromises safety, even when users do not intend to!

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 30 988–31 043

2024
[6]

arXiv preprint arXiv:2305.06972 , year=

J. Hazell, “Large language models can be used to effectively scale spear phishing campaigns,”arXiv preprint arXiv:2305.06972, p. 50, 2023

work page arXiv 2023
[7]

Ai model gpt-3 (dis) informs us better than humans,

G. Spitale, N. Biller-Andorno, and F. Germani, “Ai model gpt-3 (dis) informs us better than humans,”Science Advances, vol. 9, no. 26, p. eadh1850, 2023

2023
[8]

Can JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 large language models democratize access to dual-use biotechnology?

E. H. Soice, R. Rocha, K. Cordova, M. Specter, and K. M. Esvelt, “Can JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 large language models democratize access to dual-use biotechnology?” arXiv preprint arXiv:2306.03809, 2023

work page arXiv 2021
[9]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

arXiv preprint arXiv:2402.05044 , year=

L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y . Qiao, and J. Shao, “Salad-bench: A hierarchical and comprehensive safety benchmark for large language models,”arXiv preprint arXiv:2402.05044, 2024

work page arXiv 2024
[11]

A holistic approach to undesired content detection in the real world,

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018

2023
[12]

What is in your safe data? identifying benign data that breaks safety,

L. He, M. Xia, and P. Henderson, “What is in your safe data? identifying benign data that breaks safety,”arXiv preprint arXiv:2404.01099, 2024

work page arXiv 2024
[13]

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection,

H. Shen, P.-Y . Chen, P. Das, and T. Chen, “Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 31 243–31 264

2025
[14]

Layer- aware representation filtering: Purifying finetuning data to preserve llm safety alignment,

H. Li, L. Li, Z. Lu, X. Wei, R. Li, J. Shao, and L. Sha, “Layer- aware representation filtering: Purifying finetuning data to preserve llm safety alignment,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 8041–8061

2025
[15]

Gradsafe: Detecting jailbreak prompts for llms via safety-critical gradient analysis,

Y . Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting jailbreak prompts for llms via safety-critical gradient analysis,” inProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), 2024, pp. 507–518

2024
[16]

Assessing the brittleness of safety alignment via pruning and low-rank modifications,

B. Wei, K. Huang, Y . Huang, T. Xie, X. Qi, M. Xia, P. Mittal, M. Wang, and P. Henderson, “Assessing the brittleness of safety alignment via pruning and low-rank modifications,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 52 588–52 610

2024
[17]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, and L. Liu, “Harmful fine- tuning attacks and defenses for large language models: A survey,”arXiv preprint arXiv:2409.18169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation,

G. Liu, W. Lin, Q. Mu, T. Huang, R. Mo, Y . Tao, and L. Shen, “Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation,”IEEE Transactions on Information Forensics and Security, 2025

2025
[19]

Safety layers in aligned large language models: The key to llm security,

S. Li, L. Yao, L. Zhang, and Y . Li, “Safety layers in aligned large language models: The key to llm security,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 98 163–98 189

2025
[20]

Safe lora: The silver lining of reducing safety risks when finetuning large language models,

C.-Y . Hsu, Y .-L. Tsai, C.-H. Lin, P.-Y . Chen, C.-M. Yu, and C.-Y . Huang, “Safe lora: The silver lining of reducing safety risks when finetuning large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 65 072–65 094, 2024

2024
[21]

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack,

T. Huang, S. Hu, and L. Liu, “Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack,”Advances in Neural Information Processing Systems, vol. 37, pp. 74 058–74 088, 2024

2024
[22]

Sa- lora: Safety-alignment preserved low-rank adaptation,

M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang, “Sa- lora: Safety-alignment preserved low-rank adaptation,”arXiv preprint arXiv:2501.01765, 2025

work page arXiv 2025
[23]

Refining positive and toxic samples for dual safety self-alignment of llms with minimal human interventions,

J. Xu, G. Nan, S. Guan, S. Leng, Y . Liu, Z. Wang, Y . Ma, Z. Zhou, Y . Hou, and X. Tao, “Refining positive and toxic samples for dual safety self-alignment of llms with minimal human interventions,”IEEE Transactions on Information Forensics and Security, 2026

2026
[24]

Dia- logue injection attack: Jailbreaking llms through context manipulation,

W. Meng, F. Zhang, W. Yao, Z. Guo, Y . Li, C. Wei, and W. Chen, “Dia- logue injection attack: Jailbreaking llms through context manipulation,” IEEE Transactions on Information Forensics and Security, 2026

2026
[25]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in neural information processing systems, vol. 36, pp. 80 079–80 110, 2023

2023
[26]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation en- gineering: A top-down approach to ai transparency,”arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Programming refusal with con- ditional activation steering,

B. W. Lee, I. Padhi, K. Natesan Ramamurthy, E. Miehling, P. Dognin, M. Nagireddy, and A. Dhurandhar, “Programming refusal with con- ditional activation steering,” inInternational conference on learning representations, vol. 2025, 2025, pp. 90 960–90 985

2025
[28]

Refusal in language models is mediated by a single direction,

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,”Advances in Neural Information Processing Systems, vol. 37, pp. 136 037–136 083, 2024

2024
[29]

Understanding and mitigating over-refusal for large language models via safety repre- sentation,

J. Zhang, R. Chen, Q. Zhou, X. Deng, and W. Jiang, “Understanding and mitigating over-refusal for large language models via safety repre- sentation,”arXiv preprint arXiv:2511.19009, 2025

work page arXiv 2025
[30]

Llms encode harmfulness and refusal separately,

J. Zhao, J. Huang, Z. Wu, D. Bau, and W. Shi, “Llms encode harmfulness and refusal separately,”Advances in Neural Information Processing Systems, vol. 38, pp. 140 283–140 318, 2026

2026
[31]

Improving alignment and robustness with circuit breakers,

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,”Advances in Neural Information Processing Systems, vol. 37, pp. 83 345–83 373, 2024

2024
[32]

Representation bending for large language model safety,

A. Yousefpour, T. Kim, R. S. Kwon, S. Lee, W. Jeung, S. Han, A. Wan, H. Ngan, Y . Yu, and J. Choi, “Representation bending for large language model safety,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 24 073–24 098

2025
[33]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

H. Zhang, Z. Zhang, M. Wang, Z. Su, Y . Wang, Q. Wang, S. Yuan, E. Nie, X. Duan, F. Hanet al., “Locate, steer, and improve: A practi- cal survey of actionable mechanistic interpretability in large language models,”arXiv preprint arXiv:2601.14004, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

R. Chen, A. Arditi, H. Sleight, O. Evans, and J. Lindsey, “Persona vectors: Monitoring and controlling character traits in language models,” arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

2023
[36]

Metadefense: Defending fine-tuning based jailbreak attack before and during generation,

W. Jiang and S. Pan, “Metadefense: Defending fine-tuning based jailbreak attack before and during generation,”Advances in neural information processing systems, vol. 38, pp. 99 167–99 197, 2026

2026
[37]

Robust estimation of a location parameter,

P. J. Huber, “Robust estimation of a location parameter,” inBreak- throughs in statistics: Methodology and distribution. Springer, 1992, pp. 492–518

1992
[38]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner, “Steering llama 2 via contrastive activation addition,”arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Qwen2.5 technical report,

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...
[40]

Qwen2.5 Technical Report

[Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Keeping llms aligned after fine-tuning: The crucial role of prompt templates,

K. Lyu, H. Zhao, X. Gu, D. Yu, A. Goyal, and S. Arora, “Keeping llms aligned after fine-tuning: The crucial role of prompt templates,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 118 603– 118 631, 2024

2024
[42]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Free dolly: Introducing the world’s first truly open instructiontuned llm,

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin, “Free dolly: Introducing the world’s first truly open instructiontuned llm,” 2023

2023
[44]

Safety alignment should be made more than just a few tokens deep,

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson, “Safety alignment should be made more than just a few tokens deep,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 54 911–54 941

2025
[45]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models,

P. R ¨ottger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 5377–5400

2024
[46]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choiet al., “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 47 094–47 165, 2024

2024

[1] [1]

Removing rlhf protections in gpt-4 via fine-tuning,

Q. Zhan, R. Fang, R. Bindu, A. Gupta, T. B. Hashimoto, and D. Kang, “Removing rlhf protections in gpt-4 via fine-tuning,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 2024, pp. 681–687

2024

[2] [2]

Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b,

P. Gade, S. Lermen, C. Rogers-Smith, and J. Ladish, “Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b,”arXiv preprint arXiv:2311.00117, 2023

work page arXiv 2023

[3] [3]

I’m sorry

X. Yang, X. Wang, Q. Zhang, L. Petzold, W. Y . Wang, X. Zhao, and D. Lin, “Shadow alignment: The ease of subverting safely-aligned language models,”arXiv preprint arXiv:2310.02949, 2023

work page arXiv 2023

[4] [4]

LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

S. Lermen, C. Rogers-Smith, and J. Ladish, “Lora fine-tuning effi- ciently undoes safety training in llama 2-chat 70b,”arXiv preprint arXiv:2310.20624, 2023

work page arXiv 2023

[5] [5]

Fine-tuning aligned language models compromises safety, even when users do not intend to!

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 30 988–31 043

2024

[6] [6]

arXiv preprint arXiv:2305.06972 , year=

J. Hazell, “Large language models can be used to effectively scale spear phishing campaigns,”arXiv preprint arXiv:2305.06972, p. 50, 2023

work page arXiv 2023

[7] [7]

Ai model gpt-3 (dis) informs us better than humans,

G. Spitale, N. Biller-Andorno, and F. Germani, “Ai model gpt-3 (dis) informs us better than humans,”Science Advances, vol. 9, no. 26, p. eadh1850, 2023

2023

[8] [8]

Can JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 large language models democratize access to dual-use biotechnology?

E. H. Soice, R. Rocha, K. Cordova, M. Specter, and K. M. Esvelt, “Can JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 large language models democratize access to dual-use biotechnology?” arXiv preprint arXiv:2306.03809, 2023

work page arXiv 2021

[9] [9]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

arXiv preprint arXiv:2402.05044 , year=

L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y . Qiao, and J. Shao, “Salad-bench: A hierarchical and comprehensive safety benchmark for large language models,”arXiv preprint arXiv:2402.05044, 2024

work page arXiv 2024

[11] [11]

A holistic approach to undesired content detection in the real world,

T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018

2023

[12] [12]

What is in your safe data? identifying benign data that breaks safety,

L. He, M. Xia, and P. Henderson, “What is in your safe data? identifying benign data that breaks safety,”arXiv preprint arXiv:2404.01099, 2024

work page arXiv 2024

[13] [13]

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection,

H. Shen, P.-Y . Chen, P. Das, and T. Chen, “Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 31 243–31 264

2025

[14] [14]

Layer- aware representation filtering: Purifying finetuning data to preserve llm safety alignment,

H. Li, L. Li, Z. Lu, X. Wei, R. Li, J. Shao, and L. Sha, “Layer- aware representation filtering: Purifying finetuning data to preserve llm safety alignment,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 8041–8061

2025

[15] [15]

Gradsafe: Detecting jailbreak prompts for llms via safety-critical gradient analysis,

Y . Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting jailbreak prompts for llms via safety-critical gradient analysis,” inProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), 2024, pp. 507–518

2024

[16] [16]

Assessing the brittleness of safety alignment via pruning and low-rank modifications,

B. Wei, K. Huang, Y . Huang, T. Xie, X. Qi, M. Xia, P. Mittal, M. Wang, and P. Henderson, “Assessing the brittleness of safety alignment via pruning and low-rank modifications,” inInternational Conference on Machine Learning. PMLR, 2024, pp. 52 588–52 610

2024

[17] [17]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

T. Huang, S. Hu, F. Ilhan, S. F. Tekin, and L. Liu, “Harmful fine- tuning attacks and defenses for large language models: A survey,”arXiv preprint arXiv:2409.18169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation,

G. Liu, W. Lin, Q. Mu, T. Huang, R. Mo, Y . Tao, and L. Shen, “Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation,”IEEE Transactions on Information Forensics and Security, 2025

2025

[19] [19]

Safety layers in aligned large language models: The key to llm security,

S. Li, L. Yao, L. Zhang, and Y . Li, “Safety layers in aligned large language models: The key to llm security,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 98 163–98 189

2025

[20] [20]

Safe lora: The silver lining of reducing safety risks when finetuning large language models,

C.-Y . Hsu, Y .-L. Tsai, C.-H. Lin, P.-Y . Chen, C.-M. Yu, and C.-Y . Huang, “Safe lora: The silver lining of reducing safety risks when finetuning large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 65 072–65 094, 2024

2024

[21] [21]

Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack,

T. Huang, S. Hu, and L. Liu, “Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack,”Advances in Neural Information Processing Systems, vol. 37, pp. 74 058–74 088, 2024

2024

[22] [22]

Sa- lora: Safety-alignment preserved low-rank adaptation,

M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang, “Sa- lora: Safety-alignment preserved low-rank adaptation,”arXiv preprint arXiv:2501.01765, 2025

work page arXiv 2025

[23] [23]

Refining positive and toxic samples for dual safety self-alignment of llms with minimal human interventions,

J. Xu, G. Nan, S. Guan, S. Leng, Y . Liu, Z. Wang, Y . Ma, Z. Zhou, Y . Hou, and X. Tao, “Refining positive and toxic samples for dual safety self-alignment of llms with minimal human interventions,”IEEE Transactions on Information Forensics and Security, 2026

2026

[24] [24]

Dia- logue injection attack: Jailbreaking llms through context manipulation,

W. Meng, F. Zhang, W. Yao, Z. Guo, Y . Li, C. Wei, and W. Chen, “Dia- logue injection attack: Jailbreaking llms through context manipulation,” IEEE Transactions on Information Forensics and Security, 2026

2026

[25] [25]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in neural information processing systems, vol. 36, pp. 80 079–80 110, 2023

2023

[26] [26]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowskiet al., “Representation en- gineering: A top-down approach to ai transparency,”arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Programming refusal with con- ditional activation steering,

B. W. Lee, I. Padhi, K. Natesan Ramamurthy, E. Miehling, P. Dognin, M. Nagireddy, and A. Dhurandhar, “Programming refusal with con- ditional activation steering,” inInternational conference on learning representations, vol. 2025, 2025, pp. 90 960–90 985

2025

[28] [28]

Refusal in language models is mediated by a single direction,

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,”Advances in Neural Information Processing Systems, vol. 37, pp. 136 037–136 083, 2024

2024

[29] [29]

Understanding and mitigating over-refusal for large language models via safety repre- sentation,

J. Zhang, R. Chen, Q. Zhou, X. Deng, and W. Jiang, “Understanding and mitigating over-refusal for large language models via safety repre- sentation,”arXiv preprint arXiv:2511.19009, 2025

work page arXiv 2025

[30] [30]

Llms encode harmfulness and refusal separately,

J. Zhao, J. Huang, Z. Wu, D. Bau, and W. Shi, “Llms encode harmfulness and refusal separately,”Advances in Neural Information Processing Systems, vol. 38, pp. 140 283–140 318, 2026

2026

[31] [31]

Improving alignment and robustness with circuit breakers,

A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, R. Wang, Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,”Advances in Neural Information Processing Systems, vol. 37, pp. 83 345–83 373, 2024

2024

[32] [32]

Representation bending for large language model safety,

A. Yousefpour, T. Kim, R. S. Kwon, S. Lee, W. Jeung, S. Han, A. Wan, H. Ngan, Y . Yu, and J. Choi, “Representation bending for large language model safety,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 24 073–24 098

2025

[33] [33]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

H. Zhang, Z. Zhang, M. Wang, Z. Su, Y . Wang, Q. Wang, S. Yuan, E. Nie, X. Duan, F. Hanet al., “Locate, steer, and improve: A practi- cal survey of actionable mechanistic interpretability in large language models,”arXiv preprint arXiv:2601.14004, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

R. Chen, A. Arditi, H. Sleight, O. Evans, and J. Lindsey, “Persona vectors: Monitoring and controlling character traits in language models,” arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

2023

[36] [36]

Metadefense: Defending fine-tuning based jailbreak attack before and during generation,

W. Jiang and S. Pan, “Metadefense: Defending fine-tuning based jailbreak attack before and during generation,”Advances in neural information processing systems, vol. 38, pp. 99 167–99 197, 2026

2026

[37] [37]

Robust estimation of a location parameter,

P. J. Huber, “Robust estimation of a location parameter,” inBreak- throughs in statistics: Methodology and distribution. Springer, 1992, pp. 492–518

1992

[38] [38]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner, “Steering llama 2 via contrastive activation addition,”arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Qwen2.5 technical report,

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

[40] [40]

Qwen2.5 Technical Report

[Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Keeping llms aligned after fine-tuning: The crucial role of prompt templates,

K. Lyu, H. Zhao, X. Gu, D. Yu, A. Goyal, and S. Arora, “Keeping llms aligned after fine-tuning: The crucial role of prompt templates,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 118 603– 118 631, 2024

2024

[42] [42]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Free dolly: Introducing the world’s first truly open instructiontuned llm,

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and R. Xin, “Free dolly: Introducing the world’s first truly open instructiontuned llm,” 2023

2023

[44] [44]

Safety alignment should be made more than just a few tokens deep,

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson, “Safety alignment should be made more than just a few tokens deep,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 54 911–54 941

2025

[45] [45]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models,

P. R ¨ottger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy, “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 5377–5400

2024

[46] [46]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choiet al., “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 47 094–47 165, 2024

2024