pith. machine review for the scientific record. sign in

arxiv: 2604.24902 · v1 · submitted 2026-04-27 · 💻 cs.CY · cs.SE

Recognition: unknown

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:48 UTC · model grok-4.3

classification 💻 cs.CY cs.SE
keywords safety driftfine-tuningfoundation modelssafety benchmarkshigh-stakes domainsmedical AIlegal AImodel evaluation
0
0 comments X

The pith

Benign fine-tuning produces large, inconsistent shifts in measured safety across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether safety properties of foundation models persist after they are adapted for specific uses such as medicine or law. It compares safety scores on base models against those of 100 fine-tuned versions using both general and domain-specific benchmarks. The analysis reveals that fine-tuning triggers substantial changes that vary widely by test, with models often scoring better on one measure while worsening on another and with evaluations frequently disagreeing. A reader would care because many current safety assurances and deployment decisions rest on checks performed only on the unmodified base models. This instability suggests that risks in real-world high-stakes applications may not be captured by pre-adaptation evaluations alone.

Core claim

Across general-purpose and domain-specific safety benchmarks, benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm.

What carries the argument

Comparative safety-benchmark scoring between base foundation models and their fine-tuned counterparts across 100 models in medical and legal domains.

If this is right

  • Governance and deployment decisions that rely on base-model safety checks will miss risks introduced by ordinary fine-tuning.
  • Fine-tuned models in medical and legal domains require fresh safety evaluation in contexts that match their intended use.
  • Accountability mechanisms built around pre-adaptation assessments are inadequate for managing downstream harm.
  • Practical sources of failure in high-stakes applications can be overlooked when safety is treated as fixed after the base model stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety may need to be monitored continuously after any adaptation rather than treated as a one-time property.
  • Regulators could require post-fine-tuning audits specifically for high-stakes domains to close the gap between base-model assurances and deployed behavior.
  • Certain fine-tuning techniques might reduce the scale of these shifts, suggesting an avenue for targeted mitigation research.

Load-bearing premise

The chosen safety benchmarks accurately capture the risks that actually arise when the fine-tuned models are used in high-stakes deployments.

What would settle it

Re-testing the identical set of models with the same benchmarks and obtaining either uniformly preserved safety scores or changes that move in the same direction on every instrument would contradict the reported pattern of large heterogeneous shifts.

Figures

Figures reproduced from arXiv: 2604.24902 by Amy Winecoff, Dylan Hadfield-Menell, Emaan Bilal Khan, Miranda Bogen.

Figure 1
Figure 1. Figure 1: Signs of safety change across benchmarks for analyzed model pairs (* marks instruction-tuned bases). The same view at source ↗
Figure 2
Figure 2. Figure 2: Safety distribution across fine-tuning lineages. Each box represents the interquartile range of safety drift at a given view at source ↗
Figure 3
Figure 3. Figure 3: Directional & magnitude safety drift across legal model analysis. view at source ↗
Figure 4
Figure 4. Figure 4: Safety change vs. parameter shift under fine-tuning. Across both domains and benchmarks, safety outcomes vary view at source ↗
Figure 5
Figure 5. Figure 5: Safety drift (Δ unsafe rate, pp) post legal fine-tuning segmented by model, fine-tuning method, and benchmark. Safety effects vary by base model and method, with no tuning approach consistently preserving alignment. instability. The persistence of this trend under controlled conditions across both domain settings suggests that non-monotonic align￾ment changes are a general feature of domain adaptation rath… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of safety measurements to evaluation setup. The plots show how modifying the judging template alters view at source ↗
read the original abstract

Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical analysis of safety behavior in 100 foundation models, including deployed fine-tunes in medical and legal domains and controlled adaptations of open models. It finds that benign fine-tuning produces large, heterogeneous, and often contradictory shifts across general-purpose and domain-specific safety benchmarks, with models improving on some instruments while degrading on others. The authors conclude that safety properties are not stable under ordinary downstream adaptation and that governance practices relying on base-model evaluations are therefore inadequate for managing downstream risk in high-stakes settings.

Significance. If the observed benchmark changes prove robust and interpretable as genuine safety shifts, the work would be significant for AI governance and deployment standards. It supplies concrete evidence against the common assumption that safety characteristics persist through fine-tuning, directly challenging accountability frameworks that certify models only at the base stage. The scale of the study (100 models across domains) and the documentation of contradictory movements across instruments are strengths that could inform more rigorous post-adaptation evaluation requirements.

major comments (2)
  1. [Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.
  2. [Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.
minor comments (1)
  1. [Abstract] The abstract would benefit from brief quantitative summaries (e.g., average magnitude of score changes or fraction of models showing degradation on at least one benchmark) to convey the scale of the phenomenon more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful review of our manuscript on safety drift in fine-tuned foundation models. Below, we provide point-by-point responses to the major comments and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.

    Authors: We concur that explicit validation of the benchmarks against real-world outcomes would provide stronger grounding. The benchmarks employed are standard instruments in the AI safety community for evaluating safety in general and specialized domains. Our primary observation is the lack of stability and the contradictory shifts across these instruments following fine-tuning, which highlights the inadequacy of relying solely on base-model assessments regardless of individual benchmark validity. In the revised manuscript, we have added a dedicated paragraph in the Methods section discussing benchmark limitations, including potential sensitivities to prompting and data artifacts, and we reference studies examining their alignment with practical safety metrics. This addition clarifies the scope of our inferences. revision: partial

  2. Referee: [Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.

    Authors: The selection of the 100 models was guided by the goal of including both widely used deployed fine-tunes and controlled experiments. We have expanded the Model Selection subsection to explicitly list the criteria: (1) for deployed models, inclusion of commercially available fine-tunes in medicine and law with public documentation of their adaptation process; (2) for controlled adaptations, fine-tuning open models on standard domain datasets without safety-specific interventions; and (3) ensuring diversity in model scale and provider. This information was partially present but has been elaborated with additional details on data composition and selection process to demonstrate representativeness of ordinary fine-tuning practices. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of benchmark scores

full rationale

The paper reports an observational study measuring safety benchmark deltas across 100 models (base vs. fine-tuned) in medical and legal domains. No equations, derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on raw observed score changes and disagreement across instruments, with no reduction of any result to its own inputs by construction. No self-citation chains or ansatzes are invoked to justify the methodology or conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the safety benchmarks used are valid proxies for real deployment risks and that the selected models represent ordinary fine-tuning practice; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Safety benchmarks accurately reflect deployment-relevant safety properties.
    The paper measures safety drift exclusively through changes on these benchmarks.

pith-pipeline@v0.9.0 · 5471 in / 1224 out tokens · 61415 ms · 2026-05-07T17:48:25.330688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 14 canonical work pages

  1. [1]

    Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025

    Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, and Bernard Ghanem. Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025

  2. [2]

    Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use

    Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju. Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

  3. [3]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

  4. [4]

    Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025

  5. [5]

    Language model unalignment: Parametric red-teaming to expose hidden harms and biases.arXiv preprint arXiv:2310.14303, 2023

    Rishabh Bhardwaj and Soujanya Poria. Language model unalignment: Parametric red-teaming to expose hidden harms and biases.arXiv preprint arXiv:2310.14303, 2023

  6. [6]

    Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets

    Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International J...

  7. [7]

    Singer, Ryan E

    Rishi Bommasani, Samuel R. Singer, Ryan E. Appel, Shumin Cen, Andrew F. Cooper, Emma Cryst, Lauren A. Gailmard, Ian Klaus, Michael M. Lee, In- ioluwa Deborah Raji, Adam Reuel, Daniel Spence, Angela Wan, Alex Wang, David Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer T. Chayes, Mariano-Florentino Cuéllar, and L...

  8. [8]

    SafeLawBench: Towards safe alignment of large language models

    Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Wu Yinyu, Josef Dai, Yaodong Yang, Sirui Han, and Yike Guo. SafeLawBench: Towards safe alignment of large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 14015–14048,...

  9. [9]

    Association for Computational Linguistics

  10. [10]

    Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025

    Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, and Chen-Hsiang Yu. Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025

  11. [11]

    Robert J. Couture. The impact of artificial intelligence on law firms’ business mod- els.Harvard Law School Center on the Legal Profession, Insights, 2025. Qualitative study of AI adoption and business models in AmLaw 100 firms

  12. [12]

    Qlora: Efficient finetuning of quantized llms, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

  13. [13]

    The potential for jurisdictional challenges to ai or llm training datasets

    Chris Draper and Nicky Gillibrand. The potential for jurisdictional challenges to ai or llm training datasets. InAI4AJ@ICAIL, 2023

  14. [14]

    Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025

    Øystein Endal, Andrea Vcric, Sidsel Nag, Nick Malter, and Daylan Araz. Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025

  15. [15]

    Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act

    European Commission. Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act. https://digital- strategy.ec.europa.eu/en/library/guidelines-scope-obligations-providers- general-purpose-ai-models-under-ai-act, July 2025. Accessed: April 29, 2026

  16. [16]

    Article 11: Technical documentation

    European Parliament and Council of the European Union. Article 11: Technical documentation. InRegulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), number EU 2024/1689. Official Journal of the European Union, 2024. Accessed: 2025-10-27

  17. [17]

    Medsafe- tybench: Evaluating and improving the medical safety of large language models

    Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafe- tybench: Evaluating and improving the medical safety of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 33423–33454. Curran Associates, Inc., 2024

  18. [18]

    Bressem, Jakob Nikolas Kather, and Daniel Truhn

    Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, and Daniel Truhn. Medical foundation models are susceptible to targeted misinformation attacks, 2023

  19. [19]

    arXiv preprint arXiv:2410.15821 (2024)

    Will Hawkins, Brent Mittelstadt, and Chris Russell. The effect of fine-tuning on language model toxicity.arXiv preprint arXiv:2410.15821, 2024

  20. [20]

    What is in your safe data? identifying benign data that breaks safety, 2024

    Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024

  21. [21]

    What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

    Luxi He, Mengzhou Xia, and Peter Henderson. What’s in your “safe” data?: Identifying benign data that breaks safety.arXiv preprint, arXiv:2404.01099, 2024

  22. [22]

    Guha et al

    Peter Henderson et al. Legalbench: A collaboratively built benchmark for mea- suring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023

  23. [23]

    Self-destructing models: Increasing the costs of harmful dual uses of foundation models

    Peter Henderson, Eric Mitchell, Christopher Manning, Dan Jurafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023

  24. [24]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  25. [25]

    Harmful fine-tuning attacks and defenses for large language models: A survey, 2024

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey, 2024

  26. [26]

    Parameter-efficient fine-tuning (peft) for large language models

    Hugging Face. Parameter-efficient fine-tuning (peft) for large language models. https://huggingface.co/blog/peft, 2023. Blog post introducing the PEFT library and its support for methods like LoRA and QLoRA

  27. [27]

    Peft: Parameter-efficient fine-tuning of transformers

    Hugging Face. Peft: Parameter-efficient fine-tuning of transformers. https: //github.com/huggingface/peft, 2023. Python library supporting LoRA, QLoRA, and related PEFT methods

  28. [28]

    Increased llm vulnerabilities from fine-tuning and quantization.arXiv preprint, arXiv:2404.04392, 2024

    Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, and Prashanth Harshangi. Increased llm vulnerabilities from fine-tuning and quantization.arXiv preprint, arXiv:2404.04392, 2024

  29. [29]

    When do pre-training biases propagate to downstream tasks? a case study in text summarization

    Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, and Tatsunori B Hashimoto. When do pre-training biases propagate to downstream tasks? a case study in text summarization. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3206–3219, 2023

  30. [30]

    Anatomy of a ma- chine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811, 2025

    Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a ma- chine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811, 2025

  31. [31]

    J. S. Lehmann et al. Implementing large language models in healthcare while bal- ancing innovation, privacy, and safety.PMC / NPJ Digital Medicine, PMC11885444, March 2025

  32. [32]

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

    Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.arXiv preprint, arXiv:2310.20624, 2023

  33. [33]

    Survey reveals how gen ai is reshaping law, 2024

    LexisNexis. Survey reveals how gen ai is reshaping law, 2024. Global legal AI adoption and investment survey

  34. [34]

    D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, others, and H. Liu. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, November 2025

  35. [35]

    Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment

    Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, and Lei Sha. Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page...

  36. [36]

    Rethinking open source generative ai: open-washing and the eu ai act

    Andreas Liesenfeld and Mark Dingemanse. Rethinking open source generative ai: open-washing and the eu ai act. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 1774–1787, New York, NY, USA, 2024. Association for Computing Machinery

  37. [37]

    A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024

    Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024

  38. [38]

    meta-llama/llama-cookbook, 2023

    Meta LLama. meta-llama/llama-cookbook, 2023

  39. [39]

    Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024

    Eliza Mik. Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024

  40. [40]

    AILuminate - MLCommons — mlcommons.org

    MLCommons. AILuminate - MLCommons — mlcommons.org. https:// mlcommons.org/benchmarks/ailuminate/, 2025. [Accessed 27-10-2025]

  41. [41]

    Model spec

    OpenAI. Model spec. https://model-spec.openai.com/, 2025. Version 2025-02-12 draft of OpenAI’s specification of intended model behavior

  42. [42]

    Navigating the safety landscape: Measuring risks in finetuning large language models, 2024

    ShengYun Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models, 2024

  43. [43]

    Safety alignment should be made more than just a few tokens deep, 2024

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep, 2024

  44. [44]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

  45. [45]

    Retool state of ai 2024 report: How people actually use ai

    Retool. Retool state of ai 2024 report: How people actually use ai. https://codingscape.com/blog/retool-state-of-ai-2024-report-how-people- actually-use-ai, 2024

  46. [46]

    Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024

    Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024. , , Emaan Bilal Khan1, Amy Winecoff2, Miranda Bogen2, and Dylan Hadfield-Menell1

  47. [47]

    Self-hosting ai: For privacy, compliance, and cost effi- ciency

    AJ Richter. Self-hosting ai: For privacy, compliance, and cost effi- ciency. https://techgdpr.com/blog/self-hosting-ai-for-privacy-compliance-and- cost-efficiency/, March 2025. Accessed: 2026-01-04

  48. [48]

    Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com

    RobustIntelligence. Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com. https://www.robustintelligence.com/blog-posts/fine-tuning-llms-breaks- their-safety-and-security-alignment, 2024. [Accessed 27-10-2025]

  49. [49]

    When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022

    Hadi Salman, Saachi Jain, Andrew Ilyas, Logan Engstrom, Eric Wong, and Alek- sander Madry. When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022

  50. [50]

    The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn

    Ye King Clarence See, Khai Shin Alva Lim, Wei Yung Au, Si Yin Charlene Chia, Xiuyi Fan, and Zhenghao Kelvin Li. The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn. Comput., 9(6):151, June 2025

  51. [51]

    Understanding layer significance in llm alignment, 2025

    Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Understanding layer significance in llm alignment, 2025

  52. [52]

    The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998

    Stephen Sireci. The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998

  53. [53]

    Benchmarkcards: Standardized documentation for large language model benchmarks.arXiv preprint arXiv:2410.12974, 2024

    Anna Sokol, Elizabeth Daly, Michael Hind, David Piorkowski, Xiangliang Zhang, Nuno Moniz, and Nitesh Chawla. Benchmarkcards: Standardized documentation for large language model benchmarks.arXiv preprint arXiv:2410.12974, 2024

  54. [54]

    Daniel J. Solove. A taxonomy of privacy.University of Pennsylvania Law Review, 154(3):477–558, 2006

  55. [55]

    Does fine-tuning gpt-3 with the openai api leak personally-identifiable information?arXiv preprint, arXiv:2307.16382, 2023

    Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, and Vaikkunth Mugunthan. Does fine-tuning gpt-3 with the openai api leak personally-identifiable information?arXiv preprint, arXiv:2307.16382, 2023

  56. [56]

    Tamirisa, B

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761, 2024

  57. [57]

    Tinker: A training api for researchers

    Thinking Machines Lab. Tinker: A training api for researchers. https:// thinkingmachines.ai/tinker/, 2026

  58. [58]

    2025 Generative AI in Professional Services Report

    Thomson Reuters. 2025 Generative AI in Professional Services Report. Technical report, Thomson Reuters, 2025. Accessed: 2026-01-04

  59. [59]

    Department of Health and Human Services, Office for Civil Rights

    U.S. Department of Health and Human Services, Office for Civil Rights. Hipaa security rule to strengthen the cybersecurity of electronic protected health in- formation. Federal Register, Proposed Rule (90 FR 898), January 2025. Doc. No. 2024-30983; RIN 0945-AA22; Comments due March 7, 2025

  60. [60]

    Poisoning lan- guage models during instruction tuning

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning lan- guage models during instruction tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages...

  61. [61]

    Overwriting pretrained bias with fine- tuning data

    Angelina Wang and Olga Russakovsky. Overwriting pretrained bias with fine- tuning data. InProceedings of the IEEE/CVF international conference on computer Vision, pages 3957–3968, 2023

  62. [62]

    Safety challenges of AI in medicine in the era of large language models

    Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, and Dianbo Liu. Safety challenges of AI in medicine in the era of large language models. 2024

  63. [63]

    Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024

    Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications.arXiv preprint arXiv:2402.05162, 2024

  64. [64]

    Improving governance outcomes through ai documentation: Bridging theory and practice

    Amy Winecoff and Miranda Bogen. Improving governance outcomes through ai documentation: Bridging theory and practice. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025

  65. [65]

    Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025

    Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, and Meng Sun. Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025

  66. [66]

    Sorry-bench: Systematically evaluating large language model safety refusal

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. Sorry-bench: Systematically evaluating large language model safety refusal. InThe Thirteenth International Conference on Learning Representations (IC...

  67. [67]

    Alleviating the fear of losing alignment in llm fine-tuning, 2025

    Kang Yang, Guanhong Tao, Xun Chen, and Jun Xu. Alleviating the fear of losing alignment in llm fine-tuning, 2025

  68. [68]

    Shadow alignment: The ease of subverting safely-aligned language models, 2023

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models, 2023

  69. [69]

    Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda R. Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint, arXiv:2310.02949, 2023