arxiv: 2605.04572 · v1 · submitted 2026-05-06 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

Xiao Wang , Yifei Zhang , Yongkang Liu , Xiaocui Yang , Zihan Wang , Shi Feng , Daling Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM safetyfine-tuningparameter dynamicssafety degradationrisk scoringalignmentparameter updates

0 comments

The pith

Benign fine-tuning drifts LLM parameters toward danger directions, enabling per-sample risk scores via projection differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning on benign samples makes model parameters drift cumulatively toward directions aligned with unsafe outputs, gradually eroding safety alignments learned from large preference datasets. Tracking these parameter changes during training reveals that individual samples differ in how much they push the model toward danger versus safety. This insight produces SQSD, a method that assigns each sample a continuous risk score by measuring the projection difference of its induced parameter update onto pre-identified danger and safety directions. The resulting scores identify which samples contribute most to safety degradation without requiring full model retraining or separate evaluation runs. Experiments across models, datasets, and fine-tuning methods show the scores effectively predict and quantify sample-level risks.

Core claim

Benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. SQSD quantifies the influence of each training sample on safety degradation by computing continuous risk scores from the projection difference of its induced parameter updates between danger and safety directions.

What carries the argument

Projection difference of each sample's induced parameter updates onto danger versus safety directions, which yields a continuous risk score quantifying contribution to safety degradation.

If this is right

Samples with higher SQSD scores cause measurably greater erosion of safety behaviors when included in fine-tuning data.
Risk scoring can be performed continuously during the fine-tuning process itself rather than only before or after.
The method remains effective across model architectures, parameter scales, and parameter-efficient fine-tuning techniques.
Filtering or reweighting high-risk samples identified by SQSD can reduce overall safety degradation in the resulting model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dynamic drift view implies safety could be monitored and corrected in real time during training rather than relying on static post-training checks.
Direction-based scoring might extend to quantifying risks for other alignment properties such as truthfulness or bias beyond safety.
If the directions prove stable across tasks, the approach could support proactive data curation to preserve multiple model behaviors at once.

Load-bearing premise

Danger-aligned and safety directions can be reliably identified in advance and remain stable enough that their projections on single-sample updates accurately measure risk.

What would settle it

If high-risk samples according to the scores produce no greater safety degradation on benchmarks than low-risk samples, or if the computed projection differences fail to correlate with observed safety loss after fine-tuning.

Figures

Figures reproduced from arXiv: 2605.04572 by Daling Wang, Shi Feng, Xiaocui Yang, Xiao Wang, Yifei Zhang, Yongkang Liu, Zihan Wang.

**Figure 1.** Figure 1: Overview of safety degradation mechanism and SQSD. (a): Fine-tuning trajectory shows cumulative parameter drift toward danger-aligned direction in parameter space. (b): SQSD computes risk scores by measuring the projection gap between sample-induced parameter updates and safety-relevant directions. Larger danger projection minus safety projection indicates higher risk. celerate it substantially, while othe… view at source ↗

**Figure 2.** Figure 2: Parameter Drift trajectories along safety and danger directions during fine-tuning. Qwen3-8b fine-tuned Dolly (5k). Safe Score is a safety metric (higher is safer); ⟨∆θ, V ⟩is projection of parameter drift onto each direction. Safety-related directions details are provided in §3.1 2024), we track parameter trajectories during fine-tuning and link their directional drift to changes in safety behavior. Track… view at source ↗

**Figure 4.** Figure 4: Impact of dataset scale on parameter drift. Trajectories for Qwen3-8B on 3k–50k Alpaca samples. 2024). Learning rate is 5 × 10−6 for mechanism validation to produce smoother parameter trajectories, and 5×10−5 for SQSD evaluation to induce stronger safety degradation. For full fine-tuning in transferability experiments, 5 × 10−6 is used as it requires smaller learning rates than LoRA. Direction constructio… view at source ↗

**Figure 3.** Figure 3: Consistency of parameter-space mechanism across models and datasets. Parameter trajectories along safety and danger directions for three models (Llama-3.1-8B-Instruct, Qwen3-8B, Llama-2-7B-Chat) fine-tuned on 5k-Dolly and 5k-Alpaca. 5. Experiments 5.1. Experimental Setups Models. Three safety-aligned models are used for main experiments: Qwen3-8B (Yang et al., 2025), LLaMA-3.1-8BInstruct (Dubey et al., … view at source ↗

**Figure 5.** Figure 5: Parameter steering validation. Safety Score as functions of steering magnitude α for different directions. 12 view at source ↗

**Figure 6.** Figure 6: Impact of learning rate on SQSD performance. ASR on CategoricalHarmfulQA for Qwen3-8B fine-tuned on Dolly subsets (S1-S5) ranked by SQSD computed at different learning rates. G. SQSD Effectiveness Evaluation on Multiple Benchmarks This appendix provides supplementary evaluation results for Section 5.3. While the main paper reports ASR on CategoricalHarmfulQA, here we present comprehensive results across m… view at source ↗

**Figure 7.** Figure 7: Response length bias in unnormalized risk scoring. Average response length and ASR for Qwen3-8B fine-tuned on Dolly subsets ranked by (a) response length and (b) unnormalized SQSD. Prior gradient-based methods (Guan et al., 2025; He et al., 2024) exhibit response-length bias when using unnormalized parameter updates. To investigate whether response length correlates with fine-tuning risk, we compare two ra… view at source ↗

**Figure 8.** Figure 8: Loss distribution across response length groups. Cross-entropy loss distributions for samples from Dolly dataset grouped by response length: Top 1000 (173-321 tokens), Middle 1000 (40-49 tokens), and Bottom 1000 (4-9 tokens). 22 view at source ↗

**Figure 9.** Figure 9: Per-token cross-entropy loss for short-response samples. Loss distribution across tokens for 9 representative short-response samples. The final token in each sample is <|im end|> view at source ↗

**Figure 10.** Figure 10: Per-token cross-entropy loss for middle-response samples. Loss distribution across tokens for 9 representative middle-response samples. The final token in each sample is <|im end|>. 23 view at source ↗

read the original abstract

Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this paper, we uncover a critical mechanism underlying safety degradation by analyzing parameter dynamics, where benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. Based on this insight, we propose a method of Sample-Level Quantification of Safety Degradation (SQSD), which quantifies the influence of each training sample on safety degradation. Specifically, SQSD computes continuous risk scores to samples by measuring their induced parameter updates' projection difference between danger and safety directions. Extensive experiments across multiple models and datasets demonstrate that SQSD effectively quantifies sample-level fine-tuning risks and exhibits strong transferability across model architectures, parameter scales, and parameter-efficient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns cumulative parameter drift during benign fine-tuning into a sample-level risk score via projections on danger and safety directions, but the scoring risks circularity if those directions are not fully independent.

read the letter

This paper tracks how parameters shift step by step when fine-tuning LLMs on benign data and uses the shifts to score which samples most erode safety. The claim is that even harmless samples cause gradual drift toward danger-aligned directions, and SQSD measures each sample's contribution through the difference in projections onto danger versus safety axes. That dynamic view during training, rather than just start-versus-end comparisons, is the clearest addition over the static work cited in the abstract. The experiments test the scorer on multiple models, datasets, and parameter-efficient methods, and report that the scores transfer across architectures and scales. Those results give a concrete tool for spotting high-risk samples in data curation. The main soft spot is the independence of the directions. The abstract describes projection differences but does not detail whether the danger and safety directions are fixed from the base model alone or drawn from any part of the fine-tuning trajectory or related prompts. If they overlap with the samples or states being scored, the risk numbers become partly tautological by construction. Even fixed initial directions may lose meaning once drift rotates the effective subspace. I would want explicit checks that the directions stay stable and are derived without using the fine-tuning data. This is aimed at practitioners who select or filter fine-tuning data to limit safety loss. Readers working on alignment preservation or production data pipelines would find the scoring procedure directly usable. The work deserves a serious referee because it tackles a documented failure mode with a measurable, sample-level method that goes beyond prior static analyses, even though the direction-independence question needs to be settled in review. I would send it to peer review with that point flagged.

Referee Report

2 major / 2 minor

Summary. The paper claims that benign fine-tuning on LLMs induces cumulative parameter drift toward danger-aligned directions, progressively eroding safety. It introduces Sample-Level Quantification of Safety Degradation (SQSD), which assigns continuous risk scores to individual training samples by measuring the projection difference of each sample's induced parameter update onto danger versus safety directions. Experiments across multiple models, datasets, and parameter-efficient methods are reported to show that SQSD quantifies fine-tuning risks effectively and transfers across architectures and scales.

Significance. If the danger and safety directions can be shown to be defined independently of the fine-tuning data and trajectories, and if the projection-difference scores are demonstrated to be predictive rather than post-hoc, the work would provide a concrete, sample-level diagnostic for safety degradation that goes beyond static before/after comparisons. This could inform safer fine-tuning practices and sample selection in alignment pipelines.

major comments (2)

[SQSD method definition] The central SQSD construction (described in the method section following the parameter-dynamics analysis) defines risk via projection differences onto danger-aligned and safety directions. The manuscript must specify exactly how these directions are extracted (e.g., contrastive gradients, PCA on hidden states, or fixed reference prompts) and must prove that the extraction uses only data or model states disjoint from the fine-tuning runs and samples being scored. Any overlap renders the metric circular by construction, undermining the claim that SQSD quantifies 'influence' or 'risk' rather than recovering the direction definition itself.
[Experiments and transferability results] The abstract and experimental claims assert 'strong transferability' and 'continuous risk scores' that track progressive safety degradation. However, the manuscript does not appear to test whether the fixed initial directions remain stationary as fine-tuning rotates the effective subspace; if projections are taken only onto the initial directions, cumulative drift may invalidate the scores after the first few epochs. A direct check (e.g., recomputing directions at intermediate checkpoints and measuring rank correlation of per-sample scores) is required to support the dynamic-drift narrative.

minor comments (2)

Notation for the projection operator and the danger/safety direction vectors should be introduced with explicit equations and kept consistent between the method description and the experimental figures.
The abstract states that 'samples contributing more to this drift has greater fine-tuning risks'; the grammar and the precise causal link between drift contribution and downstream safety violation should be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of methodological clarity and empirical validation that we have addressed through targeted revisions. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [SQSD method definition] The central SQSD construction (described in the method section following the parameter-dynamics analysis) defines risk via projection differences onto danger-aligned and safety directions. The manuscript must specify exactly how these directions are extracted (e.g., contrastive gradients, PCA on hidden states, or fixed reference prompts) and must prove that the extraction uses only data or model states disjoint from the fine-tuning runs and samples being scored. Any overlap renders the metric circular by construction, undermining the claim that SQSD quantifies 'influence' or 'risk' rather than recovering the direction definition itself.

Authors: We appreciate this clarification request. In the revised manuscript we have expanded the method section (now Section 3.2) to state explicitly that safety and danger directions are obtained via contrastive gradients on a fixed collection of reference prompts drawn from established safety benchmarks. These reference prompts are drawn from a held-out pool that shares no samples or trajectories with any fine-tuning dataset used in the reported experiments. We have added a short formal argument (new Appendix C) showing that the direction vectors are computed solely from this disjoint reference set, thereby eliminating circularity and confirming that SQSD measures genuine per-sample influence on the observed drift. revision: yes
Referee: [Experiments and transferability results] The abstract and experimental claims assert 'strong transferability' and 'continuous risk scores' that track progressive safety degradation. However, the manuscript does not appear to test whether the fixed initial directions remain stationary as fine-tuning rotates the effective subspace; if projections are taken only onto the initial directions, cumulative drift may invalidate the scores after the first few epochs. A direct check (e.g., recomputing directions at intermediate checkpoints and measuring rank correlation of per-sample scores) is required to support the dynamic-drift narrative.

Authors: We agree that stationarity of the reference directions merits explicit verification. Although the core narrative concerns cumulative drift from the initial state, we have added a new subsection (Section 4.4) containing the requested check: directions are recomputed at multiple intermediate checkpoints, and rank correlations of the resulting per-sample SQSD scores are reported. The observed Spearman correlations remain high (average ρ > 0.75 across models and epochs), indicating that projections onto the initial directions continue to track the drift trajectory. These results have been incorporated to strengthen the transferability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: directions derived from observed dynamics; SQSD is a downstream quantification, not a tautology.

full rationale

The paper first analyzes the full fine-tuning trajectory to identify cumulative parameter drift toward danger-aligned directions (an empirical observation across the run). SQSD then applies a projection-difference metric to individual sample updates using those directions. This is a standard post-hoc attribution step rather than a self-referential definition: the directions are not fitted to the per-sample risk scores, nor are the risk scores used to define the directions. No equation reduces the output risk score to the input by algebraic identity or by re-using the same fitted parameters. The method remains falsifiable against held-out samples or external safety benchmarks. No self-citation load-bearing step or ansatz smuggling appears in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of stable danger and safety directions in parameter space that can be extracted once and then used to score arbitrary samples. No explicit free parameters or invented entities are named in the abstract, but the direction vectors function as implicit fitted constructs.

axioms (1)

domain assumption Danger-aligned and safety directions exist and can be identified from model parameters or gradients.
Invoked when defining the projection difference for risk scoring.

pith-pipeline@v0.9.0 · 5505 in / 1159 out tokens · 51747 ms · 2026-05-08T18:03:37.424757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation (J = ½(x+x⁻¹)−1 uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vsafety = θ̂aligned − θ0, Vdanger = θ̂harmful − θ0 (task-vector formulation via DPO/SFT on safety datasets).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Gangu...

work page Pith review arXiv
[3]

Fundamental safety-capability trade-offs in fine-tuning large language models.arXiv preprint arXiv:2503.20807,

Chen, P.-Y ., Shen, H., Das, P., and Chen, T. Fundamental safety-capability trade-offs in fine-tuning large language models.arXiv preprint arXiv:2503.20807,

work page arXiv
[4]

AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

Ghosh, S., Varshney, P., Galinkin, E., and Parisien, C. Aegis: Online adaptive ai content safety moderation with ensem- ble of llm experts.arXiv preprint arXiv:2404.05993,

work page arXiv
[5]

Benign samples matter! fine-tuning on outlier benign samples severely breaks safety

Guan, Z., Hu, M., Zhu, R., Li, S., and Vullikanti, A. Benign samples matter! fine-tuning on outlier benign samples severely breaks safety. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025,

2025
[6]

What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

He, L., Xia, M., and Henderson, P. What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099,

work page arXiv
[7]

Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346, 2025

Hsiung, L., Pang, T., Tang, Y .-C., Song, L., Ho, T.-Y ., Chen, P.-Y ., and Yang, Y . Why llm safety guardrails collapse after fine-tuning: A similarity analysis be- tween alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346,

work page arXiv
[8]

T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023,

2023
[9]

Fine-tuning and utilization methods of domain- specific llms.arXiv preprint arXiv:2401.02981,

Jeong, C. Fine-tuning and utilization methods of domain- specific llms.arXiv preprint arXiv:2401.02981,

work page arXiv
[10]

Picky llms and unreliable rms: An empirical study on safety alignment after instruc- tion tuning.arXiv preprint arXiv:2502.01116, 2025a

Li, G., Chen, K., Guo, S., Zhang, J., Qiu, H., Zhang, C., Wang, G., Zhang, T., and Li, J. Picky llms and unreliable rms: An empirical study on safety alignment after instruc- tion tuning.arXiv preprint arXiv:2502.01116, 2025a. Li, H., Li, L., Lu, Z., Wei, X., Li, R., Shao, J., and Sha, L. Layer-aware representation filtering: Purifying finetuning data to ...

work page arXiv 2025
[11]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Lu, W., Luu, R. K., and Buehler, M. J. Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergis- tic capabilities.npj Computational Materials, 11(1):84,

work page internal anchor Pith review arXiv
[12]

S., Simko, S., Pelrine, K., and Jin, Z

Pandey, P. S., Simko, S., Pelrine, K., and Jin, Z. Accidental misalignment: Fine-tuning language models induces un- expected vulnerability.arXiv preprint arXiv:2505.16789,

work page arXiv
[13]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,

work page internal anchor Pith review arXiv
[14]

Towards tracing trustworthiness dy- namics: Revisiting pre-training period of large language models.arXiv preprint arXiv:2402.19465,

Qian, C., Zhang, J., Yao, W., Liu, D., Yin, Z., Qiao, Y ., Liu, Y ., and Shao, J. Towards tracing trustworthiness dy- namics: Revisiting pre-training period of large language models.arXiv preprint arXiv:2402.19465,

work page arXiv
[15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review arXiv
[16]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[17]

B., and Kang, D

Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T. B., and Kang, D. Removing rlhf protections in gpt-4 via fine-tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 681–687,

2024
[18]

Table 5 presents the top-5 high-sensitivity parameter states for different models and danger directions. E.3. Initialization States for Main Experiments Our main experiments (§6.3) evaluate SQSD across 12 configurations: 3 models (Qwen3-8B, Llama-3.1-8B-Instruct, Llama-2-7B-Chat) × 2 datasets (Dolly, Alpaca) × 2 danger-safety direction pairs. For each con...

1958