Recognition: no theorem link
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Pith reviewed 2026-05-10 18:38 UTC · model grok-4.3
The pith
Fine-tuning guard models on benign data destroys their safety alignment by collapsing internal representational boundaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guard models lose all safety alignment when fine-tuned on benign data because the latent safety geometry, defined as the harmful-benign representational boundary, is destroyed by standard domain specialization. This is evidenced by Granite Guardian's refusal rate dropping from 85% to 0%, CKA similarity falling to zero, and all outputs becoming ambiguous. The specialization hypothesis explains this as a trade-off where concentrated safety features enable efficiency but lead to catastrophic brittleness. A mitigation called Fisher-Weighted Safety Subspace Regularization recovers most of the refusal behavior by penalizing changes to the safety directions.
What carries the argument
The safety subspace obtained from singular value decomposition on class-conditional activation differences, which encodes the boundary used for safety classification and whose integrity determines refusal performance.
If this is right
- Metrics based on representational geometry, such as centered kernel alignment and Fisher information scores, forecast safety retention more accurately than simple measures of parameter change.
- Applying curvature-aware regularization during fine-tuning can restore high refusal rates and near-original geometry in affected models.
- Agentic AI pipelines require ongoing geometric monitoring of their guard layers to ensure continued protection.
- Purpose-built safety classifiers exhibit greater vulnerability to this collapse compared to general-purpose models due to their focused representations.
Where Pith is reading between the lines
- This vulnerability implies that safety training for specialized models may need to incorporate explicit preservation of key subspaces from the start.
- Similar geometric collapse could affect other safety mechanisms in AI, suggesting broader use of subspace monitoring techniques.
- Testing the approach on additional models and fine-tuning scenarios would help determine how general the collapse phenomenon is.
Load-bearing premise
The assumption that the safety subspaces identified by SVD on activation differences are causally responsible for the model's safety classifications.
What would settle it
Performing benign fine-tuning on one of the tested guard models and observing that refusal rates stay near the original level with no drop in geometric similarity metrics.
Figures
read the original abstract
A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive $\lambda_t$ that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that fine-tuning purpose-built guard models (LlamaGuard, WildGuard, Granite Guardian) on entirely benign data destroys latent safety geometry—defined as SVD-derived subspaces from class-conditional activation differences—causing catastrophic loss of safety alignment without adversarial inputs. Granite Guardian exhibits complete collapse (refusal rate 85% to 0%, CKA to 0, 100% ambiguous outputs). The authors track this via CKA and Fisher scores, argue geometry metrics outperform absolute displacement, and propose Fisher-Weighted Safety Subspace Regularization (FW-SSR) with curvature-aware weights and adaptive λ_t scaled to task-safety gradient conflict; FW-SSR recovers 75% refusal on Granite (CKA=0.983) and lowers WildGuard ASR to 3.6%.
Significance. If the central claims hold, the work is significant for agentic AI safety: it identifies a previously under-appreciated brittleness in specialized guard models under standard domain adaptation and supplies both a diagnostic (geometry monitoring) and a mitigation (FW-SSR). Concrete multi-model results and a training-time intervention are strengths. The finding that structural metrics predict behavior better than displacement, if robust, would justify geometry-based evaluation protocols. The result is defensible but hinges on establishing that the extracted subspaces are causally operative rather than merely correlated.
major comments (3)
- [§4 (Experimental Results)] §4 (Experimental Results) and associated tables: the reported collapse metrics (Granite refusal 85%→0%, CKA→0) and FW-SSR recovery (75% refusal, CKA=0.983) are load-bearing for the specialization hypothesis, yet the manuscript provides no details on fine-tuning dataset composition/size, epoch count, learning-rate schedule, data splits, or statistical tests (e.g., confidence intervals or significance of CKA differences). Without these, reproducibility and the claim that benign fine-tuning alone suffices cannot be verified.
- [§3.2 (Safety Subspace Extraction)] §3.2 (Safety Subspace Extraction) and §5 (FW-SSR): the assertion that SVD-extracted safety subspaces causally guide classification and that their destruction produces the observed behavioral collapse rests on correlation (CKA/Fisher with refusal) plus the FW-SSR intervention. No ablation is reported that nulls, rotates, or perturbs only the top SVD directions while holding other parameters fixed, nor a controlled comparison showing displacement metrics fail to predict refusal after matching on task loss. This leaves open the possibility that geometry is a correlated symptom rather than the operative mechanism.
- [§5 (FW-SSR Method)] §5 (FW-SSR Method), definition of λ_t: the adaptive λ_t is computed from the observed task-safety gradient conflict at each step. This introduces a potential circularity for the claim that FW-SSR “actively sharpens” the subspace; the regularization strength is defined in terms of the very conflict it is meant to resolve, complicating interpretation of why structural metrics outperform displacement.
minor comments (3)
- [Abstract] Abstract and §2: the statement that “structural representational geometry predicts safety behavior more reliably than absolute displacement metrics” is central yet lacks an explicit quantitative comparison (e.g., which displacement metrics were tested and what controls were applied).
- [Notation] Notation throughout: the precise construction of class-conditional activation differences before SVD is described only at high level; an explicit equation would improve clarity and allow readers to replicate the subspace extraction.
- [Figures] Figure captions (e.g., those showing CKA trajectories): axis labels, legend entries, and error bars (if any) should be enlarged for readability; current size makes it hard to assess the claimed separation between geometry and displacement curves.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas for improving reproducibility and strengthening causal evidence, which we address point-by-point below. We will revise the manuscript to incorporate the suggested additions while preserving the core claims supported by our multi-model experiments.
read point-by-point responses
-
Referee: §4 (Experimental Results) and associated tables: the reported collapse metrics (Granite refusal 85%→0%, CKA→0) and FW-SSR recovery (75% refusal, CKA=0.983) are load-bearing for the specialization hypothesis, yet the manuscript provides no details on fine-tuning dataset composition/size, epoch count, learning-rate schedule, data splits, or statistical tests (e.g., confidence intervals or significance of CKA differences). Without these, reproducibility and the claim that benign fine-tuning alone suffices cannot be verified.
Authors: We agree that the original manuscript omitted these procedural details, which are essential for reproducibility. In the revised version, we will add a new subsection in §4 (and update the associated tables and appendix) that fully specifies: the benign fine-tuning dataset composition and size (including sources and exact sample counts per model), epoch counts, learning-rate schedules with any warm-up or decay, train/validation/test splits with ratios, and statistical tests including 95% confidence intervals and significance testing for CKA and refusal rate differences across runs. This will directly support verification that benign fine-tuning alone induces the observed collapse. revision: yes
-
Referee: §3.2 (Safety Subspace Extraction) and §5 (FW-SSR): the assertion that SVD-extracted safety subspaces causally guide classification and that their destruction produces the observed behavioral collapse rests on correlation (CKA/Fisher with refusal) plus the FW-SSR intervention. No ablation is reported that nulls, rotates, or perturbs only the top SVD directions while holding other parameters fixed, nor a controlled comparison showing displacement metrics fail to predict refusal after matching on task loss. This leaves open the possibility that geometry is a correlated symptom rather than the operative mechanism.
Authors: We acknowledge that the current evidence is correlational plus interventional via FW-SSR, without explicit post-hoc ablations on the SVD directions. The FW-SSR results provide supporting evidence because the method explicitly regularizes the extracted safety subspace (via Fisher-weighted penalties) and produces both geometric recovery (CKA=0.983) and behavioral restoration (75% refusal), which displacement-based methods do not achieve. To strengthen the causal claim, the revision will include a new ablation experiment in §3.2/§5: we will null, rotate, or scale only the top-k SVD safety directions while freezing all other parameters and measure the resulting change in refusal rates and output ambiguity. We will also add a controlled analysis matching models on final task loss to compare predictive power of geometry metrics versus absolute displacement. revision: yes
-
Referee: §5 (FW-SSR Method), definition of λ_t: the adaptive λ_t is computed from the observed task-safety gradient conflict at each step. This introduces a potential circularity for the claim that FW-SSR “actively sharpens” the subspace; the regularization strength is defined in terms of the very conflict it is meant to resolve, complicating interpretation of why structural metrics outperform displacement.
Authors: The adaptive λ_t is intentionally defined from instantaneous gradient conflict to prevent over-regularization when task and safety objectives are already aligned, allowing the method to focus sharpening where needed. This is not circular because the sharpening effect is independently verified by the post-training improvements in CKA and Fisher scores (which exceed what static anchoring achieves) and by the reduction in WildGuard ASR below baseline. In the revision we will expand §5 with a mechanistic explanation, add training-curve plots of λ_t evolution alongside geometry metrics, and include a comparison to fixed-λ variants to clarify why the adaptive formulation yields superior structural preservation over displacement-only regularization. revision: partial
Circularity Check
No significant circularity; empirical claims rest on independent measurements and interventions
full rationale
The paper's derivation chain consists of (1) SVD extraction of safety subspaces from class-conditional activation differences, (2) empirical tracking of CKA/Fisher/refusal metrics under benign fine-tuning, and (3) introduction of FW-SSR as a regularization penalty whose adaptive lambda is a standard gradient-conflict heuristic. None of these steps reduce by construction to their inputs: the subspace is extracted from data, the collapse is observed in held-out refusal rates, and FW-SSR is shown to restore both geometry and behavior via explicit training runs. No self-citation chain, fitted parameter renamed as prediction, or ansatz smuggled via prior work appears in the provided text. The superiority claim for structural metrics is supported by cross-model comparisons rather than post-hoc redefinition. The analysis is therefore self-contained against external benchmarks (pre/post fine-tuning refusal, ASR) and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- lambda_t
axioms (1)
- domain assumption Safety subspaces extracted via SVD on class-conditional activation differences represent the structured harmful-benign representational boundary that guides classification.
invented entities (1)
-
safety geometry / safety subspaces
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Dettmers et al.(2023)] Dettmers, T.; Pagnoni, A.; Fansi, A.; and Zettlemoyer, L
2023
-
[2]
[Greshake et al.(2023)] Greshake, K.; Abdelnabi, S.; Mishra, S.; Endres, C.; Holz, T.; and Fritz, M
2023
-
[3]
Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv preprint arXiv:2302.12173. [Han et al.(2024)] Han, S.; Kim, K.; Youn, R.; Kim, J.; Longpre, S.; Haejun, L.; and Shin, J
work page internal anchor Pith review arXiv 2024
-
[4]
InAdvances in Neural Information Processing Systems
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. InAdvances in Neural Information Processing Systems. [Huang et al.(2024)] Huang, T.; Hu, S.; Ilhan, F.; Tekin, S. F.; and Liu, L
2024
-
[5]
InAdvances in Neural Information Processing Systems
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack. InAdvances in Neural Information Processing Systems. [IBM Research(2024)] IBM Research
2024
- [6]
-
[7]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations. arXiv preprint arXiv:2312.06674. [Kirkpatrick et al.(2017)] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al
work page internal anchor Pith review arXiv 2017
-
[8]
10 AAAI paper [Kornblith et al.(2019)] Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G
Overcoming Catastrophic Forgetting in Neural Networks.Proceedings of the National Academy of Sciences, 114: 3521–3526. 10 AAAI paper [Kornblith et al.(2019)] Kornblith, S.; Norouzi, M.; Lee, H.; and Hinton, G
2019
-
[9]
InInternational Conference on Machine Learning, 3519–3529
Similarity of Neural Network Representations Revisited. InInternational Conference on Machine Learning, 3519–3529. [Min et al.(2024)] Min, R.; Qin, Z.; Zhang, N. L.; Shen, L.; and Cheng, M
2024
-
[10]
InAdvances in Neural Information Processing Systems
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense. InAdvances in Neural Information Processing Systems. [OpenSafetyLab(2024)] OpenSafetyLab
2024
-
[11]
MD-Judge: A Multi-Dimensional Safety Judge for Open-Source Safety Evaluation of Large Language Models.arXiv preprint arXiv:2406.17512. [Park et al.(2023)] Park, J. S.; O’Brien, J. C.; Cai, C. J.; Morris, M. R.; Liang, P.; and Bernstein, M. S
-
[12]
InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology
Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. [Park, Choe, and Veitch(2023)] Park, K.; Choe, Y . J.; and Veitch, V
2023
-
[13]
The Linear Representation Hypothesis and the Geometry of Large Language Models
The Linear Representation Hypothesis and the Geometry of Large Language Models.arXiv preprint arXiv:2311.03658. [Perez and Ribeiro(2022)] Perez, F.; and Ribeiro, I
work page internal anchor Pith review arXiv 2022
-
[14]
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore Previous Prompt: Attack Techniques For Language Models.arXiv preprint arXiv:2211.09527. [Qi et al.(2024)] Qi, X.; Huang, K.; Panda, A.; Henderson, P.; Wang, M.; and Mittal, P
work page internal anchor Pith review arXiv 2024
-
[15]
InProceedings of the AAAI Conference on Artificial Intelligence
Visual Adversarial Examples Jailbreak Aligned Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence. [Qi et al.(2023)] Qi, X.; Zeng, Y .; Xie, T.; Chen, P.-Y .; Jia, R.; Mittal, P.; and Henderson, P
2023
-
[16]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!arXiv preprint arXiv:2310.03693. [Shi et al.(2024)] Shi, H.; Xu, Z.; Wang, H.; et al
work page internal anchor Pith review arXiv 2024
-
[17]
[Taori et al.(2023)] Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T
Continual Learning of Large Language Models: A Compre- hensive Survey.ACM Computing Surveys. [Taori et al.(2023)] Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y .; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B
2023
-
[18]
https://github.com/tatsu-lab/ stanford_alpaca
Stanford Alpaca: An Instruction-Following LLaMA Model. https://github.com/tatsu-lab/ stanford_alpaca. [Wang et al.(2024)] Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y .; et al
2024
-
[19]
[Wang et al.(2025)] Wang, R.; et al
A Survey on Large Language Model based Autonomous Agents.Frontiers of Computer Science, 18(6). [Wang et al.(2025)] Wang, R.; et al
2025
-
[20]
CoRR abs/2508.01741(2025).https : / / doi
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models.arXiv preprint arXiv:2508.01741. [Yang et al.(2023)] Yang, X.; Wang, X.; Zhang, Q.; Petzold, L.; Wang, W. Y .; Zhao, X.; and Lin, D
-
[21]
Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models.arXiv preprint arXiv:2310.02949. [Zhang et al.(2024)] Zhang, Z.; Sun, M.; Ye, X.; Wei, Y .; Peng, Y .; Chen, D.; Pang, H.; and Wu, F
-
[22]
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J
How Alignment and Jailbreak Work: Explain LLM Safety Through Intermediate Hidden States.arXiv preprint arXiv:2406.05644. [Zhu et al.(2025)] Zhu, Y .; Liu, D.; Lin, Z.; Tong, W.; Zhong, S.; and Shao, J
-
[23]
InProceedings of the Conference on Empirical Methods in Natural Language Processing
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations. InProceedings of the Conference on Empirical Methods in Natural Language Processing. [Zou et al.(2023)] Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al
2023
-
[24]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation Engineering: A Top-Down Approach to AI Transparency. In arXiv preprint arXiv:2310.01405. 11
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.