arxiv: 2604.24902 · v1 · submitted 2026-04-27 · 💻 cs.CY · cs.SE

Recognition: unknown

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Emaan Bilal Khan , Amy Winecoff , Miranda Bogen , Dylan Hadfield-Menell

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:48 UTC · model grok-4.3

classification 💻 cs.CY cs.SE

keywords safety driftfine-tuningfoundation modelssafety benchmarkshigh-stakes domainsmedical AIlegal AImodel evaluation

0 comments

The pith

Benign fine-tuning produces large, inconsistent shifts in measured safety across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether safety properties of foundation models persist after they are adapted for specific uses such as medicine or law. It compares safety scores on base models against those of 100 fine-tuned versions using both general and domain-specific benchmarks. The analysis reveals that fine-tuning triggers substantial changes that vary widely by test, with models often scoring better on one measure while worsening on another and with evaluations frequently disagreeing. A reader would care because many current safety assurances and deployment decisions rest on checks performed only on the unmodified base models. This instability suggests that risks in real-world high-stakes applications may not be captured by pre-adaptation evaluations alone.

Core claim

Across general-purpose and domain-specific safety benchmarks, benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm.

What carries the argument

Comparative safety-benchmark scoring between base foundation models and their fine-tuned counterparts across 100 models in medical and legal domains.

If this is right

Governance and deployment decisions that rely on base-model safety checks will miss risks introduced by ordinary fine-tuning.
Fine-tuned models in medical and legal domains require fresh safety evaluation in contexts that match their intended use.
Accountability mechanisms built around pre-adaptation assessments are inadequate for managing downstream harm.
Practical sources of failure in high-stakes applications can be overlooked when safety is treated as fixed after the base model stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety may need to be monitored continuously after any adaptation rather than treated as a one-time property.
Regulators could require post-fine-tuning audits specifically for high-stakes domains to close the gap between base-model assurances and deployed behavior.
Certain fine-tuning techniques might reduce the scale of these shifts, suggesting an avenue for targeted mitigation research.

Load-bearing premise

The chosen safety benchmarks accurately capture the risks that actually arise when the fine-tuned models are used in high-stakes deployments.

What would settle it

Re-testing the identical set of models with the same benchmarks and obtaining either uniformly preserved safety scores or changes that move in the same direction on every instrument would contradict the reported pattern of large heterogeneous shifts.

Figures

Figures reproduced from arXiv: 2604.24902 by Amy Winecoff, Dylan Hadfield-Menell, Emaan Bilal Khan, Miranda Bogen.

**Figure 1.** Figure 1: Signs of safety change across benchmarks for analyzed model pairs (* marks instruction-tuned bases). The same view at source ↗

**Figure 2.** Figure 2: Safety distribution across fine-tuning lineages. Each box represents the interquartile range of safety drift at a given view at source ↗

**Figure 3.** Figure 3: Directional & magnitude safety drift across legal model analysis. view at source ↗

**Figure 4.** Figure 4: Safety change vs. parameter shift under fine-tuning. Across both domains and benchmarks, safety outcomes vary view at source ↗

**Figure 5.** Figure 5: Safety drift (Δ unsafe rate, pp) post legal fine-tuning segmented by model, fine-tuning method, and benchmark. Safety effects vary by base model and method, with no tuning approach consistently preserving alignment. instability. The persistence of this trend under controlled conditions across both domain settings suggests that non-monotonic alignment changes are a general feature of domain adaptation rath… view at source ↗

**Figure 6.** Figure 6: Sensitivity of safety measurements to evaluation setup. The plots show how modifying the judging template alters view at source ↗

read the original abstract

Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical analysis of safety behavior in 100 foundation models, including deployed fine-tunes in medical and legal domains and controlled adaptations of open models. It finds that benign fine-tuning produces large, heterogeneous, and often contradictory shifts across general-purpose and domain-specific safety benchmarks, with models improving on some instruments while degrading on others. The authors conclude that safety properties are not stable under ordinary downstream adaptation and that governance practices relying on base-model evaluations are therefore inadequate for managing downstream risk in high-stakes settings.

Significance. If the observed benchmark changes prove robust and interpretable as genuine safety shifts, the work would be significant for AI governance and deployment standards. It supplies concrete evidence against the common assumption that safety characteristics persist through fine-tuning, directly challenging accountability frameworks that certify models only at the base stage. The scale of the study (100 models across domains) and the documentation of contradictory movements across instruments are strengths that could inform more rigorous post-adaptation evaluation requirements.

major comments (2)

[Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.
[Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.

minor comments (1)

[Abstract] The abstract would benefit from brief quantitative summaries (e.g., average magnitude of score changes or fraction of models showing degradation on at least one benchmark) to convey the scale of the phenomenon more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful review of our manuscript on safety drift in fine-tuned foundation models. Below, we provide point-by-point responses to the major comments and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.

Authors: We concur that explicit validation of the benchmarks against real-world outcomes would provide stronger grounding. The benchmarks employed are standard instruments in the AI safety community for evaluating safety in general and specialized domains. Our primary observation is the lack of stability and the contradictory shifts across these instruments following fine-tuning, which highlights the inadequacy of relying solely on base-model assessments regardless of individual benchmark validity. In the revised manuscript, we have added a dedicated paragraph in the Methods section discussing benchmark limitations, including potential sensitivities to prompting and data artifacts, and we reference studies examining their alignment with practical safety metrics. This addition clarifies the scope of our inferences. revision: partial
Referee: [Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.

Authors: The selection of the 100 models was guided by the goal of including both widely used deployed fine-tunes and controlled experiments. We have expanded the Model Selection subsection to explicitly list the criteria: (1) for deployed models, inclusion of commercially available fine-tunes in medicine and law with public documentation of their adaptation process; (2) for controlled adaptations, fine-tuning open models on standard domain datasets without safety-specific interventions; and (3) ensuring diversity in model scale and provider. This information was partially present but has been elaborated with additional details on data composition and selection process to demonstrate representativeness of ordinary fine-tuning practices. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of benchmark scores

full rationale

The paper reports an observational study measuring safety benchmark deltas across 100 models (base vs. fine-tuned) in medical and legal domains. No equations, derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on raw observed score changes and disagreement across instruments, with no reduction of any result to its own inputs by construction. No self-citation chains or ansatzes are invoked to justify the methodology or conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the safety benchmarks used are valid proxies for real deployment risks and that the selected models represent ordinary fine-tuning practice; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Safety benchmarks accurately reflect deployment-relevant safety properties.
The paper measures safety drift exclusively through changes on these benchmarks.

pith-pipeline@v0.9.0 · 5471 in / 1224 out tokens · 61415 ms · 2026-05-07T17:48:25.330688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 14 canonical work pages

[1]

Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025

Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, and Bernard Ghanem. Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025

2025
[2]

Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use

Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju. Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025
[3]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...

2025
[4]

Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025

2025
[5]

Language model unalignment: Parametric red-teaming to expose hidden harms and biases.arXiv preprint arXiv:2310.14303, 2023

Rishabh Bhardwaj and Soujanya Poria. Language model unalignment: Parametric red-teaming to expose hidden harms and biases.arXiv preprint arXiv:2310.14303, 2023

work page arXiv 2023
[6]

Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International J...

2021
[7]

Singer, Ryan E

Rishi Bommasani, Samuel R. Singer, Ryan E. Appel, Shumin Cen, Andrew F. Cooper, Emma Cryst, Lauren A. Gailmard, Ian Klaus, Michael M. Lee, In- ioluwa Deborah Raji, Adam Reuel, Daniel Spence, Angela Wan, Alex Wang, David Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer T. Chayes, Mariano-Florentino Cuéllar, and L...

2025
[8]

SafeLawBench: Towards safe alignment of large language models

Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Wu Yinyu, Josef Dai, Yaodong Yang, Sirui Han, and Yike Guo. SafeLawBench: Towards safe alignment of large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 14015–14048,...

2025
[9]

Association for Computational Linguistics
[10]

Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025

Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, and Chen-Hsiang Yu. Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025

2025
[11]

Robert J. Couture. The impact of artificial intelligence on law firms’ business mod- els.Harvard Law School Center on the Legal Profession, Insights, 2025. Qualitative study of AI adoption and business models in AmLaw 100 firms

2025
[12]

Qlora: Efficient finetuning of quantized llms, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

2023
[13]

The potential for jurisdictional challenges to ai or llm training datasets

Chris Draper and Nicky Gillibrand. The potential for jurisdictional challenges to ai or llm training datasets. InAI4AJ@ICAIL, 2023

2023
[14]

Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025

Øystein Endal, Andrea Vcric, Sidsel Nag, Nick Malter, and Daylan Araz. Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025

2025
[15]

Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act

European Commission. Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act. https://digital- strategy.ec.europa.eu/en/library/guidelines-scope-obligations-providers- general-purpose-ai-models-under-ai-act, July 2025. Accessed: April 29, 2026

2025
[16]

Article 11: Technical documentation

European Parliament and Council of the European Union. Article 11: Technical documentation. InRegulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), number EU 2024/1689. Official Journal of the European Union, 2024. Accessed: 2025-10-27

2024
[17]

Medsafe- tybench: Evaluating and improving the medical safety of large language models

Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafe- tybench: Evaluating and improving the medical safety of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 33423–33454. Curran Associates, Inc., 2024

2024
[18]

Bressem, Jakob Nikolas Kather, and Daniel Truhn

Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, and Daniel Truhn. Medical foundation models are susceptible to targeted misinformation attacks, 2023

2023
[19]

arXiv preprint arXiv:2410.15821 (2024)

Will Hawkins, Brent Mittelstadt, and Chris Russell. The effect of fine-tuning on language model toxicity.arXiv preprint arXiv:2410.15821, 2024

work page arXiv 2024
[20]

What is in your safe data? identifying benign data that breaks safety, 2024

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024

2024
[21]

What is in your safe data? identifying benign data that breaks safety.arXiv preprint arXiv:2404.01099, 2024

Luxi He, Mengzhou Xia, and Peter Henderson. What’s in your “safe” data?: Identifying benign data that breaks safety.arXiv preprint, arXiv:2404.01099, 2024

work page arXiv 2024
[22]

Guha et al

Peter Henderson et al. Legalbench: A collaboratively built benchmark for mea- suring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023

work page arXiv 2023
[23]

Self-destructing models: Increasing the costs of harmful dual uses of foundation models

Peter Henderson, Eric Mitchell, Christopher Manning, Dan Jurafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023

2023
[24]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

2021
[25]

Harmful fine-tuning attacks and defenses for large language models: A survey, 2024

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey, 2024

2024
[26]

Parameter-efficient fine-tuning (peft) for large language models

Hugging Face. Parameter-efficient fine-tuning (peft) for large language models. https://huggingface.co/blog/peft, 2023. Blog post introducing the PEFT library and its support for methods like LoRA and QLoRA

2023
[27]

Peft: Parameter-efficient fine-tuning of transformers

Hugging Face. Peft: Parameter-efficient fine-tuning of transformers. https: //github.com/huggingface/peft, 2023. Python library supporting LoRA, QLoRA, and related PEFT methods

2023
[28]

Increased llm vulnerabilities from fine-tuning and quantization.arXiv preprint, arXiv:2404.04392, 2024

Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, and Prashanth Harshangi. Increased llm vulnerabilities from fine-tuning and quantization.arXiv preprint, arXiv:2404.04392, 2024

work page arXiv 2024
[29]

When do pre-training biases propagate to downstream tasks? a case study in text summarization

Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, and Tatsunori B Hashimoto. When do pre-training biases propagate to downstream tasks? a case study in text summarization. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3206–3219, 2023

2023
[30]

Anatomy of a ma- chine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811, 2025

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a ma- chine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811, 2025

work page arXiv 2025
[31]

J. S. Lehmann et al. Implementing large language models in healthcare while bal- ancing innovation, privacy, and safety.PMC / NPJ Digital Medicine, PMC11885444, March 2025

2025
[32]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.arXiv preprint, arXiv:2310.20624, 2023

work page arXiv 2023
[33]

Survey reveals how gen ai is reshaping law, 2024

LexisNexis. Survey reveals how gen ai is reshaping law, 2024. Global legal AI adoption and investment survey

2024
[34]

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, others, and H. Liu. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, November 2025

2025
[35]

Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment

Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, and Lei Sha. Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page...

2025
[36]

Rethinking open source generative ai: open-washing and the eu ai act

Andreas Liesenfeld and Mark Dingemanse. Rethinking open source generative ai: open-washing and the eu ai act. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 1774–1787, New York, NY, USA, 2024. Association for Computing Machinery

2024
[37]

A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024

Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024

2024
[38]

meta-llama/llama-cookbook, 2023

Meta LLama. meta-llama/llama-cookbook, 2023

2023
[39]

Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024

Eliza Mik. Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024

work page arXiv 2024
[40]

AILuminate - MLCommons — mlcommons.org

MLCommons. AILuminate - MLCommons — mlcommons.org. https:// mlcommons.org/benchmarks/ailuminate/, 2025. [Accessed 27-10-2025]

2025
[41]

Model spec

OpenAI. Model spec. https://model-spec.openai.com/, 2025. Version 2025-02-12 draft of OpenAI’s specification of intended model behavior

2025
[42]

Navigating the safety landscape: Measuring risks in finetuning large language models, 2024

ShengYun Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models, 2024

2024
[43]

Safety alignment should be made more than just a few tokens deep, 2024

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep, 2024

2024
[44]

Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023

2023
[45]

Retool state of ai 2024 report: How people actually use ai

Retool. Retool state of ai 2024 report: How people actually use ai. https://codingscape.com/blog/retool-state-of-ai-2024-report-how-people- actually-use-ai, 2024

2024
[46]

Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024. , , Emaan Bilal Khan1, Amy Winecoff2, Miranda Bogen2, and Dylan Hadfield-Menell1

2024
[47]

Self-hosting ai: For privacy, compliance, and cost effi- ciency

AJ Richter. Self-hosting ai: For privacy, compliance, and cost effi- ciency. https://techgdpr.com/blog/self-hosting-ai-for-privacy-compliance-and- cost-efficiency/, March 2025. Accessed: 2026-01-04

2025
[48]

Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com

RobustIntelligence. Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com. https://www.robustintelligence.com/blog-posts/fine-tuning-llms-breaks- their-safety-and-security-alignment, 2024. [Accessed 27-10-2025]

2024
[49]

When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022

Hadi Salman, Saachi Jain, Andrew Ilyas, Logan Engstrom, Eric Wong, and Alek- sander Madry. When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022

work page arXiv 2022
[50]

The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn

Ye King Clarence See, Khai Shin Alva Lim, Wei Yung Au, Si Yin Charlene Chia, Xiuyi Fan, and Zhenghao Kelvin Li. The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn. Comput., 9(6):151, June 2025

2025
[51]

Understanding layer significance in llm alignment, 2025

Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Understanding layer significance in llm alignment, 2025

2025
[52]

The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998

Stephen Sireci. The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998

1998
[53]

Benchmarkcards: Standardized documentation for large language model benchmarks.arXiv preprint arXiv:2410.12974, 2024

Anna Sokol, Elizabeth Daly, Michael Hind, David Piorkowski, Xiangliang Zhang, Nuno Moniz, and Nitesh Chawla. Benchmarkcards: Standardized documentation for large language model benchmarks.arXiv preprint arXiv:2410.12974, 2024

work page arXiv 2024
[54]

Daniel J. Solove. A taxonomy of privacy.University of Pennsylvania Law Review, 154(3):477–558, 2006

2006
[55]

Does fine-tuning gpt-3 with the openai api leak personally-identifiable information?arXiv preprint, arXiv:2307.16382, 2023

Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, and Vaikkunth Mugunthan. Does fine-tuning gpt-3 with the openai api leak personally-identifiable information?arXiv preprint, arXiv:2307.16382, 2023

work page arXiv 2023
[56]

Tamirisa, B

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761, 2024

work page arXiv 2024
[57]

Tinker: A training api for researchers

Thinking Machines Lab. Tinker: A training api for researchers. https:// thinkingmachines.ai/tinker/, 2026

2026
[58]

2025 Generative AI in Professional Services Report

Thomson Reuters. 2025 Generative AI in Professional Services Report. Technical report, Thomson Reuters, 2025. Accessed: 2026-01-04

2025
[59]

Department of Health and Human Services, Office for Civil Rights

U.S. Department of Health and Human Services, Office for Civil Rights. Hipaa security rule to strengthen the cybersecurity of electronic protected health in- formation. Federal Register, Proposed Rule (90 FR 898), January 2025. Doc. No. 2024-30983; RIN 0945-AA22; Comments due March 7, 2025

2025
[60]

Poisoning lan- guage models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning lan- guage models during instruction tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages...

2023
[61]

Overwriting pretrained bias with fine- tuning data

Angelina Wang and Olga Russakovsky. Overwriting pretrained bias with fine- tuning data. InProceedings of the IEEE/CVF international conference on computer Vision, pages 3957–3968, 2023

2023
[62]

Safety challenges of AI in medicine in the era of large language models

Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, and Dianbo Liu. Safety challenges of AI in medicine in the era of large language models. 2024

2024
[63]

Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications.arXiv preprint arXiv:2402.05162, 2024

work page arXiv 2024
[64]

Improving governance outcomes through ai documentation: Bridging theory and practice

Amy Winecoff and Miranda Bogen. Improving governance outcomes through ai documentation: Bridging theory and practice. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025

2025
[65]

Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025

Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, and Meng Sun. Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025

2025
[66]

Sorry-bench: Systematically evaluating large language model safety refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. Sorry-bench: Systematically evaluating large language model safety refusal. InThe Thirteenth International Conference on Learning Representations (IC...

2025
[67]

Alleviating the fear of losing alignment in llm fine-tuning, 2025

Kang Yang, Guanhong Tao, Xun Chen, and Jun Xu. Alleviating the fear of losing alignment in llm fine-tuning, 2025

2025
[68]

Shadow alignment: The ease of subverting safely-aligned language models, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models, 2023

2023
[69]

Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

Xianjun Yang, Xiao Wang, Qi Zhang, Linda R. Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint, arXiv:2310.02949, 2023

work page arXiv 2023