pith. machine review for the scientific record. sign in

arxiv: 2605.02914 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI· cs.CR

Recognition: no theorem link

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords guard modelssafety alignmentfine-tuning vulnerabilitiesrepresentational geometrysubspace regularizationagentic systemsrefusal behavior
0
0 comments X

The pith

Fine-tuning guard models on benign data destroys their safety alignment by collapsing internal representational boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that guard models specialized for safety in agentic systems lose all refusal capability when fine-tuned on entirely benign data, as the structured boundary in their activations between harmful and safe inputs is erased. This collapse occurs through ordinary domain adaptation rather than attacks, leading to complete failure in models such as Granite Guardian where refusal rates reach zero and outputs turn ambiguous. A sympathetic reader would care because these models serve as protective layers in AI pipelines, yet their safety can vanish without adversarial input. The authors measure the process using per-layer subspaces and introduce a regularization approach that actively preserves the boundary during training.

Core claim

Guard models lose all safety alignment when fine-tuned on benign data because the latent safety geometry, defined as the harmful-benign representational boundary, is destroyed by standard domain specialization. This is evidenced by Granite Guardian's refusal rate dropping from 85% to 0%, CKA similarity falling to zero, and all outputs becoming ambiguous. The specialization hypothesis explains this as a trade-off where concentrated safety features enable efficiency but lead to catastrophic brittleness. A mitigation called Fisher-Weighted Safety Subspace Regularization recovers most of the refusal behavior by penalizing changes to the safety directions.

What carries the argument

The safety subspace obtained from singular value decomposition on class-conditional activation differences, which encodes the boundary used for safety classification and whose integrity determines refusal performance.

If this is right

  • Metrics based on representational geometry, such as centered kernel alignment and Fisher information scores, forecast safety retention more accurately than simple measures of parameter change.
  • Applying curvature-aware regularization during fine-tuning can restore high refusal rates and near-original geometry in affected models.
  • Agentic AI pipelines require ongoing geometric monitoring of their guard layers to ensure continued protection.
  • Purpose-built safety classifiers exhibit greater vulnerability to this collapse compared to general-purpose models due to their focused representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This vulnerability implies that safety training for specialized models may need to incorporate explicit preservation of key subspaces from the start.
  • Similar geometric collapse could affect other safety mechanisms in AI, suggesting broader use of subspace monitoring techniques.
  • Testing the approach on additional models and fine-tuning scenarios would help determine how general the collapse phenomenon is.

Load-bearing premise

The assumption that the safety subspaces identified by SVD on activation differences are causally responsible for the model's safety classifications.

What would settle it

Performing benign fine-tuning on one of the tested guard models and observing that refusal rates stay near the original level with no drop in geometric similarity metrics.

Figures

Figures reproduced from arXiv: 2605.02914 by Ismail Hossain, Jannatul Ferdaus, Md Jahangir Alam, Sai Puppala, Sajedul Talukder, Syed Bahauddin Alam, Yoonpyo Lee.

Figure 1
Figure 1. Figure 1: Benign Fine-Tuning Destroys Safety Alignment [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Safety alignment collapse and recovery in guard models under benign fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Granite Guardian 3.0-2B– Latent Space Geometry Layer 39 Harmful vs. Benign Prompt Separation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-Layer Safety Drift Heatmap improvement beyond the original baseline is explained geometrically: the Fisher-weighted regularization actively sharpens the harmful– benign classification boundary (Fisher score 0.118 → 0.153; inter-class distance 37.1 → 62.7), rather than merely anchoring it to its pre-fine-tuning state. Finally, the highest residual ASR under FW-SSR occurs on AdvBench (5.0%), whose direct… view at source ↗
read the original abstract

A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive $\lambda_t$ that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that fine-tuning purpose-built guard models (LlamaGuard, WildGuard, Granite Guardian) on entirely benign data destroys latent safety geometry—defined as SVD-derived subspaces from class-conditional activation differences—causing catastrophic loss of safety alignment without adversarial inputs. Granite Guardian exhibits complete collapse (refusal rate 85% to 0%, CKA to 0, 100% ambiguous outputs). The authors track this via CKA and Fisher scores, argue geometry metrics outperform absolute displacement, and propose Fisher-Weighted Safety Subspace Regularization (FW-SSR) with curvature-aware weights and adaptive λ_t scaled to task-safety gradient conflict; FW-SSR recovers 75% refusal on Granite (CKA=0.983) and lowers WildGuard ASR to 3.6%.

Significance. If the central claims hold, the work is significant for agentic AI safety: it identifies a previously under-appreciated brittleness in specialized guard models under standard domain adaptation and supplies both a diagnostic (geometry monitoring) and a mitigation (FW-SSR). Concrete multi-model results and a training-time intervention are strengths. The finding that structural metrics predict behavior better than displacement, if robust, would justify geometry-based evaluation protocols. The result is defensible but hinges on establishing that the extracted subspaces are causally operative rather than merely correlated.

major comments (3)
  1. [§4 (Experimental Results)] §4 (Experimental Results) and associated tables: the reported collapse metrics (Granite refusal 85%→0%, CKA→0) and FW-SSR recovery (75% refusal, CKA=0.983) are load-bearing for the specialization hypothesis, yet the manuscript provides no details on fine-tuning dataset composition/size, epoch count, learning-rate schedule, data splits, or statistical tests (e.g., confidence intervals or significance of CKA differences). Without these, reproducibility and the claim that benign fine-tuning alone suffices cannot be verified.
  2. [§3.2 (Safety Subspace Extraction)] §3.2 (Safety Subspace Extraction) and §5 (FW-SSR): the assertion that SVD-extracted safety subspaces causally guide classification and that their destruction produces the observed behavioral collapse rests on correlation (CKA/Fisher with refusal) plus the FW-SSR intervention. No ablation is reported that nulls, rotates, or perturbs only the top SVD directions while holding other parameters fixed, nor a controlled comparison showing displacement metrics fail to predict refusal after matching on task loss. This leaves open the possibility that geometry is a correlated symptom rather than the operative mechanism.
  3. [§5 (FW-SSR Method)] §5 (FW-SSR Method), definition of λ_t: the adaptive λ_t is computed from the observed task-safety gradient conflict at each step. This introduces a potential circularity for the claim that FW-SSR “actively sharpens” the subspace; the regularization strength is defined in terms of the very conflict it is meant to resolve, complicating interpretation of why structural metrics outperform displacement.
minor comments (3)
  1. [Abstract] Abstract and §2: the statement that “structural representational geometry predicts safety behavior more reliably than absolute displacement metrics” is central yet lacks an explicit quantitative comparison (e.g., which displacement metrics were tested and what controls were applied).
  2. [Notation] Notation throughout: the precise construction of class-conditional activation differences before SVD is described only at high level; an explicit equation would improve clarity and allow readers to replicate the subspace extraction.
  3. [Figures] Figure captions (e.g., those showing CKA trajectories): axis labels, legend entries, and error bars (if any) should be enlarged for readability; current size makes it hard to assess the claimed separation between geometry and displacement curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for improving reproducibility and strengthening causal evidence, which we address point-by-point below. We will revise the manuscript to incorporate the suggested additions while preserving the core claims supported by our multi-model experiments.

read point-by-point responses
  1. Referee: §4 (Experimental Results) and associated tables: the reported collapse metrics (Granite refusal 85%→0%, CKA→0) and FW-SSR recovery (75% refusal, CKA=0.983) are load-bearing for the specialization hypothesis, yet the manuscript provides no details on fine-tuning dataset composition/size, epoch count, learning-rate schedule, data splits, or statistical tests (e.g., confidence intervals or significance of CKA differences). Without these, reproducibility and the claim that benign fine-tuning alone suffices cannot be verified.

    Authors: We agree that the original manuscript omitted these procedural details, which are essential for reproducibility. In the revised version, we will add a new subsection in §4 (and update the associated tables and appendix) that fully specifies: the benign fine-tuning dataset composition and size (including sources and exact sample counts per model), epoch counts, learning-rate schedules with any warm-up or decay, train/validation/test splits with ratios, and statistical tests including 95% confidence intervals and significance testing for CKA and refusal rate differences across runs. This will directly support verification that benign fine-tuning alone induces the observed collapse. revision: yes

  2. Referee: §3.2 (Safety Subspace Extraction) and §5 (FW-SSR): the assertion that SVD-extracted safety subspaces causally guide classification and that their destruction produces the observed behavioral collapse rests on correlation (CKA/Fisher with refusal) plus the FW-SSR intervention. No ablation is reported that nulls, rotates, or perturbs only the top SVD directions while holding other parameters fixed, nor a controlled comparison showing displacement metrics fail to predict refusal after matching on task loss. This leaves open the possibility that geometry is a correlated symptom rather than the operative mechanism.

    Authors: We acknowledge that the current evidence is correlational plus interventional via FW-SSR, without explicit post-hoc ablations on the SVD directions. The FW-SSR results provide supporting evidence because the method explicitly regularizes the extracted safety subspace (via Fisher-weighted penalties) and produces both geometric recovery (CKA=0.983) and behavioral restoration (75% refusal), which displacement-based methods do not achieve. To strengthen the causal claim, the revision will include a new ablation experiment in §3.2/§5: we will null, rotate, or scale only the top-k SVD safety directions while freezing all other parameters and measure the resulting change in refusal rates and output ambiguity. We will also add a controlled analysis matching models on final task loss to compare predictive power of geometry metrics versus absolute displacement. revision: yes

  3. Referee: §5 (FW-SSR Method), definition of λ_t: the adaptive λ_t is computed from the observed task-safety gradient conflict at each step. This introduces a potential circularity for the claim that FW-SSR “actively sharpens” the subspace; the regularization strength is defined in terms of the very conflict it is meant to resolve, complicating interpretation of why structural metrics outperform displacement.

    Authors: The adaptive λ_t is intentionally defined from instantaneous gradient conflict to prevent over-regularization when task and safety objectives are already aligned, allowing the method to focus sharpening where needed. This is not circular because the sharpening effect is independently verified by the post-training improvements in CKA and Fisher scores (which exceed what static anchoring achieves) and by the reduction in WildGuard ASR below baseline. In the revision we will expand §5 with a mechanistic explanation, add training-curve plots of λ_t evolution alongside geometry metrics, and include a comparison to fixed-λ variants to clarify why the adaptive formulation yields superior structural preservation over displacement-only regularization. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent measurements and interventions

full rationale

The paper's derivation chain consists of (1) SVD extraction of safety subspaces from class-conditional activation differences, (2) empirical tracking of CKA/Fisher/refusal metrics under benign fine-tuning, and (3) introduction of FW-SSR as a regularization penalty whose adaptive lambda is a standard gradient-conflict heuristic. None of these steps reduce by construction to their inputs: the subspace is extracted from data, the collapse is observed in held-out refusal rates, and FW-SSR is shown to restore both geometry and behavior via explicit training runs. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggled via prior work appears in the provided text. The superiority claim for structural metrics is supported by cross-model comparisons rather than post-hoc redefinition. The analysis is therefore self-contained against external benchmarks (pre/post fine-tuning refusal, ASR) and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that safety subspaces (measured via SVD) are the load-bearing structure for refusal behavior, plus one adaptive free parameter in the mitigation and the postulated entity of safety geometry.

free parameters (1)
  • lambda_t
    Adaptive scaling factor with task-safety gradient conflict, introduced to balance the regularization during fine-tuning.
axioms (1)
  • domain assumption Safety subspaces extracted via SVD on class-conditional activation differences represent the structured harmful-benign representational boundary that guides classification.
    Invoked to explain the mechanism of collapse and to justify tracking CKA and Fisher metrics.
invented entities (1)
  • safety geometry / safety subspaces no independent evidence
    purpose: To explain why benign fine-tuning destroys safety alignment and to motivate the regularization approach.
    Measured via SVD but treated as the causal latent structure; no independent falsifiable handle provided outside the paper's metrics.

pith-pipeline@v0.9.0 · 5622 in / 1551 out tokens · 51071 ms · 2026-05-10T18:38:12.853887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    [Dettmers et al.(2023)] Dettmers, T.; Pagnoni, A.; Fansi, A.; and Zettlemoyer, L

  2. [2]

    [Greshake et al.(2023)] Greshake, K.; Abdelnabi, S.; Mishra, S.; Endres, C.; Holz, T.; and Fritz, M

  3. [3]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv preprint arXiv:2302.12173. [Han et al.(2024)] Han, S.; Kim, K.; Youn, R.; Kim, J.; Longpre, S.; Haejun, L.; and Shin, J

  4. [4]

    InAdvances in Neural Information Processing Systems

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. InAdvances in Neural Information Processing Systems. [Huang et al.(2024)] Huang, T.; Hu, S.; Ilhan, F.; Tekin, S. F.; and Liu, L

  5. [5]

    InAdvances in Neural Information Processing Systems

    Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack. InAdvances in Neural Information Processing Systems. [IBM Research(2024)] IBM Research

  6. [6]

    Qwen Team

    Granite Guardian.arXiv preprint arXiv:2412.07724. [Inan et al.(2023)] Inan, H.; Upasani, K.; Chi, J.; Rungta, R.; Iyer, K.; Mao, Y .; Tontchev, M.; Hu, Q.; Fuller, B.; Testuggine, D.; et al

  7. [7]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations. arXiv preprint arXiv:2312.06674. [Kirkpatrick et al.(2017)] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al

  8. [8]

    10 AAAI paper [Kornblith et al.(2019)] Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G

    Overcoming Catastrophic Forgetting in Neural Networks.Proceedings of the National Academy of Sciences, 114: 3521–3526. 10 AAAI paper [Kornblith et al.(2019)] Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G

  9. [9]

    InInternational Conference on Machine Learning, 3519–3529

    Similarity of Neural Network Representations Revisited. InInternational Conference on Machine Learning, 3519–3529. [Min et al.(2024)] Min, R.; Qin, Z.; Zhang, N. L.; Shen, L.; and Cheng, M

  10. [10]

    InAdvances in Neural Information Processing Systems

    Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense. InAdvances in Neural Information Processing Systems. [OpenSafetyLab(2024)] OpenSafetyLab

  11. [11]

    [Park et al.(2023)] Park, J

    MD-Judge: A Multi-Dimensional Safety Judge for Open-Source Safety Evaluation of Large Language Models.arXiv preprint arXiv:2406.17512. [Park et al.(2023)] Park, J. S.; O’Brien, J. C.; Cai, C. J.; Morris, M. R.; Liang, P.; and Bernstein, M. S

  12. [12]

    InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

    Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. [Park, Choe, and Veitch(2023)] Park, K.; Choe, Y . J.; and Veitch, V

  13. [13]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    The Linear Representation Hypothesis and the Geometry of Large Language Models.arXiv preprint arXiv:2311.03658. [Perez and Ribeiro(2022)] Perez, F.; and Ribeiro, I

  14. [14]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Ignore Previous Prompt: Attack Techniques For Language Models.arXiv preprint arXiv:2211.09527. [Qi et al.(2024)] Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P

  15. [15]

    InProceedings of the AAAI Conference on Artificial Intelligence

    Visual Adversarial Examples Jailbreak Aligned Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence. [Qi et al.(2023)] Qi, X.; Zeng, Y .; Xie, T.; Chen, P.-Y .; Jia, R.; Mittal, P.; and Henderson, P

  16. [16]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!arXiv preprint arXiv:2310.03693. [Shi et al.(2024)] Shi, H.; Xu, Z.; Wang, H.; et al

  17. [17]

    [Taori et al.(2023)] Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T

    Continual Learning of Large Language Models: A Compre- hensive Survey.ACM Computing Surveys. [Taori et al.(2023)] Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B

  18. [18]

    https://github.com/tatsu-lab/ stanford_alpaca

    Stanford Alpaca: An Instruction-Following LLaMA Model. https://github.com/tatsu-lab/ stanford_alpaca. [Wang et al.(2024)] Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y .; et al

  19. [19]

    [Wang et al.(2025)] Wang, R.; et al

    A Survey on Large Language Model based Autonomous Agents.Frontiers of Computer Science, 18(6). [Wang et al.(2025)] Wang, R.; et al

  20. [20]

    CoRR abs/2508.01741(2025).https : / / doi

    Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models.arXiv preprint arXiv:2508.01741. [Yang et al.(2023)] Yang, X.; Wang, X.; Zhang, Q.; Petzold, L.; Wang, W. Y .; Zhao, X.; and Lin, D

  21. [21]

    Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

    Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models.arXiv preprint arXiv:2310.02949. [Zhang et al.(2024)] Zhang, Z.; Sun, M.; Ye, X.; Wei, Y .; Peng, Y .; Chen, D.; Pang, H.; and Wu, F

  22. [22]

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J

    How Alignment and Jailbreak Work: Explain LLM Safety Through Intermediate Hidden States.arXiv preprint arXiv:2406.05644. [Zhu et al.(2025)] Zhu, Y .; Liu, D.; Lin, Z.; Tong, W.; Zhong, S.; and Shao, J

  23. [23]

    InProceedings of the Conference on Empirical Methods in Natural Language Processing

    The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations. InProceedings of the Conference on Empirical Methods in Natural Language Processing. [Zou et al.(2023)] Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al

  24. [24]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation Engineering: A Top-Down Approach to AI Transparency. In arXiv preprint arXiv:2310.01405. 11