pith. machine review for the scientific record. sign in

arxiv: 2605.04572 · v1 · submitted 2026-05-06 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:03 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM safetyfine-tuningparameter dynamicssafety degradationrisk scoringalignmentparameter updates
0
0 comments X

The pith

Benign fine-tuning drifts LLM parameters toward danger directions, enabling per-sample risk scores via projection differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning on benign samples makes model parameters drift cumulatively toward directions aligned with unsafe outputs, gradually eroding safety alignments learned from large preference datasets. Tracking these parameter changes during training reveals that individual samples differ in how much they push the model toward danger versus safety. This insight produces SQSD, a method that assigns each sample a continuous risk score by measuring the projection difference of its induced parameter update onto pre-identified danger and safety directions. The resulting scores identify which samples contribute most to safety degradation without requiring full model retraining or separate evaluation runs. Experiments across models, datasets, and fine-tuning methods show the scores effectively predict and quantify sample-level risks.

Core claim

Benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. SQSD quantifies the influence of each training sample on safety degradation by computing continuous risk scores from the projection difference of its induced parameter updates between danger and safety directions.

What carries the argument

Projection difference of each sample's induced parameter updates onto danger versus safety directions, which yields a continuous risk score quantifying contribution to safety degradation.

If this is right

  • Samples with higher SQSD scores cause measurably greater erosion of safety behaviors when included in fine-tuning data.
  • Risk scoring can be performed continuously during the fine-tuning process itself rather than only before or after.
  • The method remains effective across model architectures, parameter scales, and parameter-efficient fine-tuning techniques.
  • Filtering or reweighting high-risk samples identified by SQSD can reduce overall safety degradation in the resulting model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dynamic drift view implies safety could be monitored and corrected in real time during training rather than relying on static post-training checks.
  • Direction-based scoring might extend to quantifying risks for other alignment properties such as truthfulness or bias beyond safety.
  • If the directions prove stable across tasks, the approach could support proactive data curation to preserve multiple model behaviors at once.

Load-bearing premise

Danger-aligned and safety directions can be reliably identified in advance and remain stable enough that their projections on single-sample updates accurately measure risk.

What would settle it

If high-risk samples according to the scores produce no greater safety degradation on benchmarks than low-risk samples, or if the computed projection differences fail to correlate with observed safety loss after fine-tuning.

Figures

Figures reproduced from arXiv: 2605.04572 by Daling Wang, Shi Feng, Xiaocui Yang, Xiao Wang, Yifei Zhang, Yongkang Liu, Zihan Wang.

Figure 1
Figure 1. Figure 1: Overview of safety degradation mechanism and SQSD. (a): Fine-tuning trajectory shows cumulative parameter drift toward danger-aligned direction in parameter space. (b): SQSD computes risk scores by measuring the projection gap between sample-induced parameter updates and safety-relevant directions. Larger danger projection minus safety projection indicates higher risk. celerate it substantially, while othe… view at source ↗
Figure 2
Figure 2. Figure 2: Parameter Drift trajectories along safety and danger directions during fine-tuning. Qwen3-8b fine-tuned Dolly (5k). Safe Score is a safety metric (higher is safer); ⟨∆θ, V ⟩is projection of parameter drift onto each direction. Safety-related directions details are provided in §3.1 2024), we track parameter trajectories during fine-tuning and link their directional drift to changes in safety behavior. Track… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of dataset scale on parameter drift. Trajectories for Qwen3-8B on 3k–50k Alpaca samples. 2024). Learning rate is 5 × 10−6 for mechanism validation to produce smoother parameter trajectories, and 5×10−5 for SQSD evaluation to induce stronger safety degradation. For full fine-tuning in transferability experiments, 5 × 10−6 is used as it requires smaller learning rates than LoRA. Direc￾tion constructio… view at source ↗
Figure 3
Figure 3. Figure 3: Consistency of parameter-space mechanism across mod￾els and datasets. Parameter trajectories along safety and danger directions for three models (Llama-3.1-8B-Instruct, Qwen3-8B, Llama-2-7B-Chat) fine-tuned on 5k-Dolly and 5k-Alpaca. 5. Experiments 5.1. Experimental Setups Models. Three safety-aligned models are used for main ex￾periments: Qwen3-8B (Yang et al., 2025), LLaMA-3.1-8B￾Instruct (Dubey et al., … view at source ↗
Figure 5
Figure 5. Figure 5: Parameter steering validation. Safety Score as functions of steering magnitude α for different directions. 12 view at source ↗
Figure 6
Figure 6. Figure 6: Impact of learning rate on SQSD performance. ASR on CategoricalHarmfulQA for Qwen3-8B fine-tuned on Dolly subsets (S1-S5) ranked by SQSD computed at different learning rates. G. SQSD Effectiveness Evaluation on Multiple Benchmarks This appendix provides supplementary evaluation results for Section 5.3. While the main paper reports ASR on Categorical￾HarmfulQA, here we present comprehensive results across m… view at source ↗
Figure 7
Figure 7. Figure 7: Response length bias in unnormalized risk scoring. Average response length and ASR for Qwen3-8B fine-tuned on Dolly subsets ranked by (a) response length and (b) unnormalized SQSD. Prior gradient-based methods (Guan et al., 2025; He et al., 2024) exhibit response-length bias when using unnormalized parameter updates. To investigate whether response length correlates with fine-tuning risk, we compare two ra… view at source ↗
Figure 8
Figure 8. Figure 8: Loss distribution across response length groups. Cross-entropy loss distributions for samples from Dolly dataset grouped by response length: Top 1000 (173-321 tokens), Middle 1000 (40-49 tokens), and Bottom 1000 (4-9 tokens). 22 view at source ↗
Figure 9
Figure 9. Figure 9: Per-token cross-entropy loss for short-response samples. Loss distribution across tokens for 9 representative short-response samples. The final token in each sample is <|im end|> view at source ↗
Figure 10
Figure 10. Figure 10: Per-token cross-entropy loss for middle-response samples. Loss distribution across tokens for 9 representative middle-response samples. The final token in each sample is <|im end|>. 23 view at source ↗
read the original abstract

Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidden states before and after fine-tuning, but overlook their dynamic evolution during fine-tuning. In this paper, we uncover a critical mechanism underlying safety degradation by analyzing parameter dynamics, where benign fine-tuning causes parameters to cumulatively drift toward danger-aligned directions, progressively undermining the model's safety. This finding suggests that samples contributing more to this drift has greater fine-tuning risks. Based on this insight, we propose a method of Sample-Level Quantification of Safety Degradation (SQSD), which quantifies the influence of each training sample on safety degradation. Specifically, SQSD computes continuous risk scores to samples by measuring their induced parameter updates' projection difference between danger and safety directions. Extensive experiments across multiple models and datasets demonstrate that SQSD effectively quantifies sample-level fine-tuning risks and exhibits strong transferability across model architectures, parameter scales, and parameter-efficient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that benign fine-tuning on LLMs induces cumulative parameter drift toward danger-aligned directions, progressively eroding safety. It introduces Sample-Level Quantification of Safety Degradation (SQSD), which assigns continuous risk scores to individual training samples by measuring the projection difference of each sample's induced parameter update onto danger versus safety directions. Experiments across multiple models, datasets, and parameter-efficient methods are reported to show that SQSD quantifies fine-tuning risks effectively and transfers across architectures and scales.

Significance. If the danger and safety directions can be shown to be defined independently of the fine-tuning data and trajectories, and if the projection-difference scores are demonstrated to be predictive rather than post-hoc, the work would provide a concrete, sample-level diagnostic for safety degradation that goes beyond static before/after comparisons. This could inform safer fine-tuning practices and sample selection in alignment pipelines.

major comments (2)
  1. [SQSD method definition] The central SQSD construction (described in the method section following the parameter-dynamics analysis) defines risk via projection differences onto danger-aligned and safety directions. The manuscript must specify exactly how these directions are extracted (e.g., contrastive gradients, PCA on hidden states, or fixed reference prompts) and must prove that the extraction uses only data or model states disjoint from the fine-tuning runs and samples being scored. Any overlap renders the metric circular by construction, undermining the claim that SQSD quantifies 'influence' or 'risk' rather than recovering the direction definition itself.
  2. [Experiments and transferability results] The abstract and experimental claims assert 'strong transferability' and 'continuous risk scores' that track progressive safety degradation. However, the manuscript does not appear to test whether the fixed initial directions remain stationary as fine-tuning rotates the effective subspace; if projections are taken only onto the initial directions, cumulative drift may invalidate the scores after the first few epochs. A direct check (e.g., recomputing directions at intermediate checkpoints and measuring rank correlation of per-sample scores) is required to support the dynamic-drift narrative.
minor comments (2)
  1. Notation for the projection operator and the danger/safety direction vectors should be introduced with explicit equations and kept consistent between the method description and the experimental figures.
  2. The abstract states that 'samples contributing more to this drift has greater fine-tuning risks'; the grammar and the precise causal link between drift contribution and downstream safety violation should be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of methodological clarity and empirical validation that we have addressed through targeted revisions. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [SQSD method definition] The central SQSD construction (described in the method section following the parameter-dynamics analysis) defines risk via projection differences onto danger-aligned and safety directions. The manuscript must specify exactly how these directions are extracted (e.g., contrastive gradients, PCA on hidden states, or fixed reference prompts) and must prove that the extraction uses only data or model states disjoint from the fine-tuning runs and samples being scored. Any overlap renders the metric circular by construction, undermining the claim that SQSD quantifies 'influence' or 'risk' rather than recovering the direction definition itself.

    Authors: We appreciate this clarification request. In the revised manuscript we have expanded the method section (now Section 3.2) to state explicitly that safety and danger directions are obtained via contrastive gradients on a fixed collection of reference prompts drawn from established safety benchmarks. These reference prompts are drawn from a held-out pool that shares no samples or trajectories with any fine-tuning dataset used in the reported experiments. We have added a short formal argument (new Appendix C) showing that the direction vectors are computed solely from this disjoint reference set, thereby eliminating circularity and confirming that SQSD measures genuine per-sample influence on the observed drift. revision: yes

  2. Referee: [Experiments and transferability results] The abstract and experimental claims assert 'strong transferability' and 'continuous risk scores' that track progressive safety degradation. However, the manuscript does not appear to test whether the fixed initial directions remain stationary as fine-tuning rotates the effective subspace; if projections are taken only onto the initial directions, cumulative drift may invalidate the scores after the first few epochs. A direct check (e.g., recomputing directions at intermediate checkpoints and measuring rank correlation of per-sample scores) is required to support the dynamic-drift narrative.

    Authors: We agree that stationarity of the reference directions merits explicit verification. Although the core narrative concerns cumulative drift from the initial state, we have added a new subsection (Section 4.4) containing the requested check: directions are recomputed at multiple intermediate checkpoints, and rank correlations of the resulting per-sample SQSD scores are reported. The observed Spearman correlations remain high (average ρ > 0.75 across models and epochs), indicating that projections onto the initial directions continue to track the drift trajectory. These results have been incorporated to strengthen the transferability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: directions derived from observed dynamics; SQSD is a downstream quantification, not a tautology.

full rationale

The paper first analyzes the full fine-tuning trajectory to identify cumulative parameter drift toward danger-aligned directions (an empirical observation across the run). SQSD then applies a projection-difference metric to individual sample updates using those directions. This is a standard post-hoc attribution step rather than a self-referential definition: the directions are not fitted to the per-sample risk scores, nor are the risk scores used to define the directions. No equation reduces the output risk score to the input by algebraic identity or by re-using the same fitted parameters. The method remains falsifiable against held-out samples or external safety benchmarks. No self-citation load-bearing step or ansatz smuggling appears in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of stable danger and safety directions in parameter space that can be extracted once and then used to score arbitrary samples. No explicit free parameters or invented entities are named in the abstract, but the direction vectors function as implicit fitted constructs.

axioms (1)
  • domain assumption Danger-aligned and safety directions exist and can be identified from model parameters or gradients.
    Invoked when defining the projection difference for risk scoring.

pith-pipeline@v0.9.0 · 5505 in / 1159 out tokens · 51747 ms · 2026-05-08T18:03:37.424757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Gangu...

  3. [3]

    Fundamental safety-capability trade-offs in fine-tuning large language models.arXiv preprint arXiv:2503.20807,

    Chen, P.-Y ., Shen, H., Das, P., and Chen, T. Fundamental safety-capability trade-offs in fine-tuning large language models.arXiv preprint arXiv:2503.20807,

  4. [4]

    AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

    Ghosh, S., Varshney, P., Galinkin, E., and Parisien, C. Aegis: Online adaptive ai content safety moderation with ensem- ble of llm experts.arXiv preprint arXiv:2404.05993,

  5. [5]

    Benign samples matter! fine-tuning on outlier benign samples severely breaks safety

    Guan, Z., Hu, M., Zhu, R., Li, S., and Vullikanti, A. Benign samples matter! fine-tuning on outlier benign samples severely breaks safety. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025,

  6. [6]

    What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

    He, L., Xia, M., and Henderson, P. What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099,

  7. [7]

    Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346, 2025

    Hsiung, L., Pang, T., Tang, Y .-C., Song, L., Ho, T.-Y ., Chen, P.-Y ., and Yang, Y . Why llm safety guardrails collapse after fine-tuning: A similarity analysis be- tween alignment and fine-tuning datasets.arXiv preprint arXiv:2506.05346,

  8. [8]

    T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A

    Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023,

  9. [9]

    Fine-tuning and utilization methods of domain- specific llms.arXiv preprint arXiv:2401.02981,

    Jeong, C. Fine-tuning and utilization methods of domain- specific llms.arXiv preprint arXiv:2401.02981,

  10. [10]

    Picky llms and unreliable rms: An empirical study on safety alignment after instruc- tion tuning.arXiv preprint arXiv:2502.01116, 2025a

    Li, G., Chen, K., Guo, S., Zhang, J., Qiu, H., Zhang, C., Wang, G., Zhang, T., and Li, J. Picky llms and unreliable rms: An empirical study on safety alignment after instruc- tion tuning.arXiv preprint arXiv:2502.01116, 2025a. Li, H., Li, L., Lu, Z., Wei, X., Li, R., Shao, J., and Sha, L. Layer-aware representation filtering: Purifying finetuning data to ...

  11. [11]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. Lu, W., Luu, R. K., and Buehler, M. J. Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergis- tic capabilities.npj Computational Materials, 11(1):84,

  12. [12]

    S., Simko, S., Pelrine, K., and Jin, Z

    Pandey, P. S., Simko, S., Pelrine, K., and Jin, Z. Accidental misalignment: Fine-tuning language models induces un- expected vulnerability.arXiv preprint arXiv:2505.16789,

  13. [13]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Qi, X., Zeng, Y ., Xie, T., Chen, P.-Y ., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693,

  14. [14]

    Towards tracing trustworthiness dy- namics: Revisiting pre-training period of large language models.arXiv preprint arXiv:2402.19465,

    Qian, C., Zhang, J., Yao, W., Liu, D., Yin, Z., Qiao, Y ., Liu, Y ., and Shao, J. Towards tracing trustworthiness dy- namics: Revisiting pre-training period of large language models.arXiv preprint arXiv:2402.19465,

  15. [15]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

  16. [16]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  17. [17]

    B., and Kang, D

    Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T. B., and Kang, D. Removing rlhf protections in gpt-4 via fine-tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 681–687,

  18. [18]

    Table 5 presents the top-5 high-sensitivity parameter states for different models and danger directions. E.3. Initialization States for Main Experiments Our main experiments (§6.3) evaluate SQSD across 12 configurations: 3 models (Qwen3-8B, Llama-3.1-8B-Instruct, Llama-2-7B-Chat) × 2 datasets (Dolly, Alpaca) × 2 danger-safety direction pairs. For each con...