Recognition: unknown
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Pith reviewed 2026-05-07 17:48 UTC · model grok-4.3
The pith
Benign fine-tuning produces large, inconsistent shifts in measured safety across benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across general-purpose and domain-specific safety benchmarks, benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm.
What carries the argument
Comparative safety-benchmark scoring between base foundation models and their fine-tuned counterparts across 100 models in medical and legal domains.
If this is right
- Governance and deployment decisions that rely on base-model safety checks will miss risks introduced by ordinary fine-tuning.
- Fine-tuned models in medical and legal domains require fresh safety evaluation in contexts that match their intended use.
- Accountability mechanisms built around pre-adaptation assessments are inadequate for managing downstream harm.
- Practical sources of failure in high-stakes applications can be overlooked when safety is treated as fixed after the base model stage.
Where Pith is reading between the lines
- Safety may need to be monitored continuously after any adaptation rather than treated as a one-time property.
- Regulators could require post-fine-tuning audits specifically for high-stakes domains to close the gap between base-model assurances and deployed behavior.
- Certain fine-tuning techniques might reduce the scale of these shifts, suggesting an avenue for targeted mitigation research.
Load-bearing premise
The chosen safety benchmarks accurately capture the risks that actually arise when the fine-tuned models are used in high-stakes deployments.
What would settle it
Re-testing the identical set of models with the same benchmarks and obtaining either uniformly preserved safety scores or changes that move in the same direction on every instrument would contradict the reported pattern of large heterogeneous shifts.
Figures
read the original abstract
Foundation models are routinely fine-tuned for use in particular domains, yet safety assessments are typically conducted only on base models, implicitly assuming that safety properties persist through downstream adaptation. We test this assumption by analyzing the safety behavior of 100 models, including widely deployed fine-tunes in the medical and legal domains as well as controlled adaptations of open foundation models alongside their bases. Across general-purpose and domain-specific safety benchmarks, we find that benign fine-tuning induces large, heterogeneous, and often contradictory changes in measured safety: models frequently improve on some instruments while degrading on others, with substantial disagreement across evaluations. These results show that safety behavior is not stable under ordinary downstream adaptation, raising critical questions about governance and deployment practices centered on base-model evaluations. Without explicit re-evaluation of fine-tuned models in deployment-relevant contexts, such approaches fall short of adequately managing downstream risk, overlooking practical sources of harm -- failures that are especially consequential in high-stakes settings and challenge current accountability paradigms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical analysis of safety behavior in 100 foundation models, including deployed fine-tunes in medical and legal domains and controlled adaptations of open models. It finds that benign fine-tuning produces large, heterogeneous, and often contradictory shifts across general-purpose and domain-specific safety benchmarks, with models improving on some instruments while degrading on others. The authors conclude that safety properties are not stable under ordinary downstream adaptation and that governance practices relying on base-model evaluations are therefore inadequate for managing downstream risk in high-stakes settings.
Significance. If the observed benchmark changes prove robust and interpretable as genuine safety shifts, the work would be significant for AI governance and deployment standards. It supplies concrete evidence against the common assumption that safety characteristics persist through fine-tuning, directly challenging accountability frameworks that certify models only at the base stage. The scale of the study (100 models across domains) and the documentation of contradictory movements across instruments are strengths that could inform more rigorous post-adaptation evaluation requirements.
major comments (2)
- [Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.
- [Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.
minor comments (1)
- [Abstract] The abstract would benefit from brief quantitative summaries (e.g., average magnitude of score changes or fraction of models showing degradation on at least one benchmark) to convey the scale of the phenomenon more precisely.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful review of our manuscript on safety drift in fine-tuned foundation models. Below, we provide point-by-point responses to the major comments and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section (benchmark description): The central claim that fine-tuning induces 'safety drift' rests on the assumption that score changes on the selected general-purpose and domain-specific instruments reflect deployment-relevant safety properties in medicine and law. The manuscript provides no validation of these benchmarks against real-world failure modes, expert judgments, or incident data; without such grounding, the heterogeneous and contradictory movements could arise from prompt sensitivity or dataset artifacts rather than substantive capability changes, weakening the inference that base-model evaluations are inadequate.
Authors: We concur that explicit validation of the benchmarks against real-world outcomes would provide stronger grounding. The benchmarks employed are standard instruments in the AI safety community for evaluating safety in general and specialized domains. Our primary observation is the lack of stability and the contradictory shifts across these instruments following fine-tuning, which highlights the inadequacy of relying solely on base-model assessments regardless of individual benchmark validity. In the revised manuscript, we have added a dedicated paragraph in the Methods section discussing benchmark limitations, including potential sensitivities to prompting and data artifacts, and we reference studies examining their alignment with practical safety metrics. This addition clarifies the scope of our inferences. revision: partial
-
Referee: [Model selection and results] Model selection and results sections: The analysis of 100 models is presented as representative of typical fine-tuning practices in high-stakes domains, yet the criteria for choosing the specific fine-tunes (e.g., training data composition, scale, or selection process) are not detailed. This leaves open the possibility that the observed instability is driven by non-representative cases rather than ordinary adaptation, which is load-bearing for the generalizability of the governance implications.
Authors: The selection of the 100 models was guided by the goal of including both widely used deployed fine-tunes and controlled experiments. We have expanded the Model Selection subsection to explicitly list the criteria: (1) for deployed models, inclusion of commercially available fine-tunes in medicine and law with public documentation of their adaptation process; (2) for controlled adaptations, fine-tuning open models on standard domain datasets without safety-specific interventions; and (3) ensuring diversity in model scale and provider. This information was partially present but has been elaborated with additional details on data composition and selection process to demonstrate representativeness of ordinary fine-tuning practices. revision: yes
Circularity Check
No circularity: direct empirical comparison of benchmark scores
full rationale
The paper reports an observational study measuring safety benchmark deltas across 100 models (base vs. fine-tuned) in medical and legal domains. No equations, derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on raw observed score changes and disagreement across instruments, with no reduction of any result to its own inputs by construction. No self-citation chains or ansatzes are invoked to justify the methodology or conclusions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety benchmarks accurately reflect deployment-relevant safety properties.
Reference graph
Works this paper leans on
-
[1]
Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025
Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi, and Bernard Ghanem. Unforgotten safety: Preserving safety alignment of large language models with continual learning, 2025
2025
-
[2]
Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use
Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju. Unpacking trust dynamics in the llm supply chain: An empirical exploration to foster trustworthy llm production & use. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025
2025
-
[3]
Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May...
2025
-
[4]
Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow fine- tuning can produce broadly misaligned llms, 2025
2025
-
[5]
Rishabh Bhardwaj and Soujanya Poria. Language model unalignment: Parametric red-teaming to expose hidden harms and biases.arXiv preprint arXiv:2310.14303, 2023
-
[6]
Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets
Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International J...
2021
-
[7]
Singer, Ryan E
Rishi Bommasani, Samuel R. Singer, Ryan E. Appel, Shumin Cen, Andrew F. Cooper, Emma Cryst, Lauren A. Gailmard, Ian Klaus, Michael M. Lee, In- ioluwa Deborah Raji, Adam Reuel, Daniel Spence, Angela Wan, Alex Wang, David Zhang, Daniel E. Ho, Percy Liang, Dawn Song, Joseph E. Gonzalez, Jonathan Zittrain, Jennifer T. Chayes, Mariano-Florentino Cuéllar, and L...
2025
-
[8]
SafeLawBench: Towards safe alignment of large language models
Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Wu Yinyu, Josef Dai, Yaodong Yang, Sirui Han, and Yike Guo. SafeLawBench: Towards safe alignment of large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 14015–14048,...
2025
-
[9]
Association for Computational Linguistics
-
[10]
Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025
Sijia Chen, Xiaomin Li, Mengxue Zhang, Eric Hanchen Jiang, Qingcheng Zeng, and Chen-Hsiang Yu. Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms, 2025
2025
-
[11]
Robert J. Couture. The impact of artificial intelligence on law firms’ business mod- els.Harvard Law School Center on the Legal Profession, Insights, 2025. Qualitative study of AI adoption and business models in AmLaw 100 firms
2025
-
[12]
Qlora: Efficient finetuning of quantized llms, 2023
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023
2023
-
[13]
The potential for jurisdictional challenges to ai or llm training datasets
Chris Draper and Nicky Gillibrand. The potential for jurisdictional challenges to ai or llm training datasets. InAI4AJ@ICAIL, 2023
2023
-
[14]
Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025
Øystein Endal, Andrea Vcric, Sidsel Nag, Nick Malter, and Daylan Araz. Modifying ai under the eu ai act: Lessons from practice on classification and compliance, November 2025
2025
-
[15]
Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act
European Commission. Guidelines on the scope of obligations for providers of general-purpose ai models under the ai act. https://digital- strategy.ec.europa.eu/en/library/guidelines-scope-obligations-providers- general-purpose-ai-models-under-ai-act, July 2025. Accessed: April 29, 2026
2025
-
[16]
Article 11: Technical documentation
European Parliament and Council of the European Union. Article 11: Technical documentation. InRegulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act), number EU 2024/1689. Official Journal of the European Union, 2024. Accessed: 2025-10-27
2024
-
[17]
Medsafe- tybench: Evaluating and improving the medical safety of large language models
Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Medsafe- tybench: Evaluating and improving the medical safety of large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 33423–33454. Curran Associates, Inc., 2024
2024
-
[18]
Bressem, Jakob Nikolas Kather, and Daniel Truhn
Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, and Daniel Truhn. Medical foundation models are susceptible to targeted misinformation attacks, 2023
2023
-
[19]
arXiv preprint arXiv:2410.15821 (2024)
Will Hawkins, Brent Mittelstadt, and Chris Russell. The effect of fine-tuning on language model toxicity.arXiv preprint arXiv:2410.15821, 2024
-
[20]
What is in your safe data? identifying benign data that breaks safety, 2024
Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024
2024
-
[21]
Luxi He, Mengzhou Xia, and Peter Henderson. What’s in your “safe” data?: Identifying benign data that breaks safety.arXiv preprint, arXiv:2404.01099, 2024
-
[22]
Peter Henderson et al. Legalbench: A collaboratively built benchmark for mea- suring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023
-
[23]
Self-destructing models: Increasing the costs of harmful dual uses of foundation models
Peter Henderson, Eric Mitchell, Christopher Manning, Dan Jurafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023
2023
-
[24]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021
2021
-
[25]
Harmful fine-tuning attacks and defenses for large language models: A survey, 2024
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey, 2024
2024
-
[26]
Parameter-efficient fine-tuning (peft) for large language models
Hugging Face. Parameter-efficient fine-tuning (peft) for large language models. https://huggingface.co/blog/peft, 2023. Blog post introducing the PEFT library and its support for methods like LoRA and QLoRA
2023
-
[27]
Peft: Parameter-efficient fine-tuning of transformers
Hugging Face. Peft: Parameter-efficient fine-tuning of transformers. https: //github.com/huggingface/peft, 2023. Python library supporting LoRA, QLoRA, and related PEFT methods
2023
-
[28]
Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, and Prashanth Harshangi. Increased llm vulnerabilities from fine-tuning and quantization.arXiv preprint, arXiv:2404.04392, 2024
-
[29]
When do pre-training biases propagate to downstream tasks? a case study in text summarization
Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, and Tatsunori B Hashimoto. When do pre-training biases propagate to downstream tasks? a case study in text summarization. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3206–3219, 2023
2023
-
[30]
Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a ma- chine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811, 2025
-
[31]
J. S. Lehmann et al. Implementing large language models in healthcare while bal- ancing innovation, privacy, and safety.PMC / NPJ Digital Medicine, PMC11885444, March 2025
2025
-
[32]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.arXiv preprint, arXiv:2310.20624, 2023
-
[33]
Survey reveals how gen ai is reshaping law, 2024
LexisNexis. Survey reveals how gen ai is reshaping law, 2024. Global legal AI adoption and investment survey
2024
-
[34]
D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, others, and H. Liu. From generation to judgment: Opportunities and challenges of LLM-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, November 2025
2025
-
[35]
Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment
Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, and Lei Sha. Layer- aware representation filtering: Purifying finetuning data to preserve LLM safety alignment. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page...
2025
-
[36]
Rethinking open source generative ai: open-washing and the eu ai act
Andreas Liesenfeld and Mark Dingemanse. Rethinking open source generative ai: open-washing and the eu ai act. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 1774–1787, New York, NY, USA, 2024. Association for Computing Machinery
2024
-
[37]
A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024
Lei Liu, Xiaoyan Yang, Junchi Lei, Yue Shen, Jian Wang, Peng Wei, Zhixuan Chu, Zhan Qin, and Kui Ren. A survey on medical large language models: Technology, application, trustworthiness, and future directions, 2024
2024
-
[38]
meta-llama/llama-cookbook, 2023
Meta LLama. meta-llama/llama-cookbook, 2023
2023
-
[39]
Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024
Eliza Mik. Caveat lector: Large language models in legal practice.arXiv preprint, arXiv:2403.09163, 2024
-
[40]
AILuminate - MLCommons — mlcommons.org
MLCommons. AILuminate - MLCommons — mlcommons.org. https:// mlcommons.org/benchmarks/ailuminate/, 2025. [Accessed 27-10-2025]
2025
-
[41]
Model spec
OpenAI. Model spec. https://model-spec.openai.com/, 2025. Version 2025-02-12 draft of OpenAI’s specification of intended model behavior
2025
-
[42]
Navigating the safety landscape: Measuring risks in finetuning large language models, 2024
ShengYun Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau. Navigating the safety landscape: Measuring risks in finetuning large language models, 2024
2024
-
[43]
Safety alignment should be made more than just a few tokens deep, 2024
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep, 2024
2024
-
[44]
Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!, 2023
2023
-
[45]
Retool state of ai 2024 report: How people actually use ai
Retool. Retool state of ai 2024 report: How people actually use ai. https://codingscape.com/blog/retool-state-of-ai-2024-report-how-people- actually-use-ai, 2024
2024
-
[46]
Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J Kochenderfer. Betterbench: Assessing ai benchmarks, uncovering issues, and establishing best practices.Advances in Neural Information Processing Systems, 37:21763–21813, 2024. , , Emaan Bilal Khan1, Amy Winecoff2, Miranda Bogen2, and Dylan Hadfield-Menell1
2024
-
[47]
Self-hosting ai: For privacy, compliance, and cost effi- ciency
AJ Richter. Self-hosting ai: For privacy, compliance, and cost effi- ciency. https://techgdpr.com/blog/self-hosting-ai-for-privacy-compliance-and- cost-efficiency/, March 2025. Accessed: 2026-01-04
2025
-
[48]
Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com
RobustIntelligence. Fine-Tuning LLMs Breaks Their Safety and Se- curity Alignment — Robust Intelligence — robustintelligence.com. https://www.robustintelligence.com/blog-posts/fine-tuning-llms-breaks- their-safety-and-security-alignment, 2024. [Accessed 27-10-2025]
2024
-
[49]
When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022
Hadi Salman, Saachi Jain, Andrew Ilyas, Logan Engstrom, Eric Wong, and Alek- sander Madry. When does bias transfer in transfer learning?arXiv preprint arXiv:2207.02842, 2022
-
[50]
The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn
Ye King Clarence See, Khai Shin Alva Lim, Wei Yung Au, Si Yin Charlene Chia, Xiuyi Fan, and Zhenghao Kelvin Li. The use of large language models in oph- thalmology: A scoping review on current use-cases and considerations for future works in this field.Big Data Cogn. Comput., 9(6):151, June 2025
2025
-
[51]
Understanding layer significance in llm alignment, 2025
Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, and Xiao-Ming Wu. Understanding layer significance in llm alignment, 2025
2025
-
[52]
The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998
Stephen Sireci. The construct of content validity.Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45(1):83–117, November 1998
1998
-
[53]
Anna Sokol, Elizabeth Daly, Michael Hind, David Piorkowski, Xiangliang Zhang, Nuno Moniz, and Nitesh Chawla. Benchmarkcards: Standardized documentation for large language model benchmarks.arXiv preprint arXiv:2410.12974, 2024
-
[54]
Daniel J. Solove. A taxonomy of privacy.University of Pennsylvania Law Review, 154(3):477–558, 2006
2006
-
[55]
Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, and Vaikkunth Mugunthan. Does fine-tuning gpt-3 with the openai api leak personally-identifiable information?arXiv preprint, arXiv:2307.16382, 2023
-
[56]
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms.arXiv preprint arXiv:2408.00761, 2024
-
[57]
Tinker: A training api for researchers
Thinking Machines Lab. Tinker: A training api for researchers. https:// thinkingmachines.ai/tinker/, 2026
2026
-
[58]
2025 Generative AI in Professional Services Report
Thomson Reuters. 2025 Generative AI in Professional Services Report. Technical report, Thomson Reuters, 2025. Accessed: 2026-01-04
2025
-
[59]
Department of Health and Human Services, Office for Civil Rights
U.S. Department of Health and Human Services, Office for Civil Rights. Hipaa security rule to strengthen the cybersecurity of electronic protected health in- formation. Federal Register, Proposed Rule (90 FR 898), January 2025. Doc. No. 2024-30983; RIN 0945-AA22; Comments due March 7, 2025
2025
-
[60]
Poisoning lan- guage models during instruction tuning
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning lan- guage models during instruction tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages...
2023
-
[61]
Overwriting pretrained bias with fine- tuning data
Angelina Wang and Olga Russakovsky. Overwriting pretrained bias with fine- tuning data. InProceedings of the IEEE/CVF international conference on computer Vision, pages 3957–3968, 2023
2023
-
[62]
Safety challenges of AI in medicine in the era of large language models
Xiaoye Wang, Nicole Xi Zhang, Hongyu He, Trang Nguyen, Kun-Hsing Yu, Hao Deng, Cynthia Brandt, Danielle S Bitterman, Ling Pan, Ching-Yu Cheng, James Zou, and Dianbo Liu. Safety challenges of AI in medicine in the era of large language models. 2024
2024
-
[63]
Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications.arXiv preprint arXiv:2402.05162, 2024
-
[64]
Improving governance outcomes through ai documentation: Bridging theory and practice
Amy Winecoff and Miranda Bogen. Improving governance outcomes through ai documentation: Bridging theory and practice. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025
2025
-
[65]
Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025
Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, and Meng Sun. Miti- gating fine-tuning risks in llms via safety-aware probing optimization, 2025
2025
-
[66]
Sorry-bench: Systematically evaluating large language model safety refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. Sorry-bench: Systematically evaluating large language model safety refusal. InThe Thirteenth International Conference on Learning Representations (IC...
2025
-
[67]
Alleviating the fear of losing alignment in llm fine-tuning, 2025
Kang Yang, Guanhong Tao, Xun Chen, and Jun Xu. Alleviating the fear of losing alignment in llm fine-tuning, 2025
2025
-
[68]
Shadow alignment: The ease of subverting safely-aligned language models, 2023
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models, 2023
2023
-
[69]
Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin
Xianjun Yang, Xiao Wang, Qi Zhang, Linda R. Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint, arXiv:2310.02949, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.