pith. machine review for the scientific record. sign in

arxiv: 2604.07754 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.CL

Recognition: no theorem link

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang , Hongwei Li , Yun Shen , Xinyue Shen , Wenbo Jiang , Guowen Xu , Yang Liu , Michael Backes , Yang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:14 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM alignmentmisalignmentfine-tuningORPODPOpreference optimizationmodel safety
0
0 comments X

The pith

Fine-tuning methods create an asymmetry where ORPO excels at misaligning LLMs and DPO at realigning them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how different fine-tuning techniques can turn safety-aligned large language models unsafe and then restore their safety. It establishes that Odds Ratio Preference Optimization stands out as the most effective approach for misalignment, while Direct Preference Optimization performs best for realignment even though it lowers the model's general usefulness. This asymmetry between attack and defense methods matters because it shows how readily aligned models can be compromised when obtained from untrusted sources and then only partially restored. The study also finds that models differ in their resistance to these changes and that repeated adversarial fine-tuning leaves lingering effects that complicate full recovery.

Core claim

Evaluating four Supervised Fine-Tuning and two Preference Fine-Tuning methods across four safety-aligned LLMs reveals a mechanism asymmetry: Odds Ratio Preference Optimization proves most effective for misalignment, whereas Direct Preference Optimization excels in realignment but at the cost of model utility. The work further identifies model-specific resistance and residual effects from multi-round adversarial dynamics.

What carries the argument

The mechanism asymmetry between misalignment and realignment in post-training fine-tuning methods for LLMs.

If this is right

  • ORPO can be used to create misaligned versions of safety-trained LLMs with high effectiveness.
  • DPO can restore safety to misaligned models but reduces performance on unrelated tasks.
  • Residual effects from prior misalignment rounds persist even after realignment.
  • Different LLMs show varying degrees of resistance to both misalignment and realignment attempts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety checks for third-party LLMs should include targeted tests against ORPO-based misalignment.
  • Realigned models require separate evaluation of retained utility beyond safety metrics.
  • Alignment strategies may need to be customized per model family due to observed resistance differences.
  • The lingering effects of repeated adversarial fine-tuning suggest that alignment history should be tracked.

Load-bearing premise

The four supervised fine-tuning and two preference fine-tuning methods together with the four chosen LLMs are representative enough for the observed asymmetry and model-specific effects to hold more generally.

What would settle it

A follow-up experiment using different fine-tuning methods or additional LLMs that fails to reproduce ORPO as the strongest misaligner and DPO as the strongest realigner would falsify the asymmetry claim.

Figures

Figures reproduced from arXiv: 2604.07754 by Guowen Xu, Hongwei Li, Michael Backes, Rui Zhang, Wenbo Jiang, Xinyue Shen, Yang Liu, Yang Zhang, Yun Shen.

Figure 1
Figure 1. Figure 1: Overview of evaluation workflow. response ru as the preferred response yc and the unsafe re￾sponse rs as the rejected response yr to reverse the built-in safety alignment. The optimization objective is argmax θ ∑ (q,ru,rs)∈D LPFT (θ;q,ru,rs), (2) where LPFT is the loss function specific to PFT methods, which can be derived from the losses associated with ei￾ther the DPO or ORPO frameworks as described in A… view at source ↗
Figure 2
Figure 2. Figure 2: Model unsafety scores following misalignment. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Unsafety scores across 10 categories. We use grey (filled) and red (outlined) polygons to indicate unsafety levels of baseline [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model unsafety of different sizes of misalignment dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ∆Sutility and ∆Sunsafety between the realigned and the misaligned models. We adopt safe-rlhf as the realignment dataset, and Llama3.1 and Gemma2 as the target models. Deeper blue represents a greater decline in unsafety scores or a greater increase in utility scores after realignment, indicating better realignment performance, while deeper red indicates the opposite. M1 R1 M2 R2 M3 R3 M4 R4 M5 R5 Round 0 2… view at source ↗
Figure 6
Figure 6. Figure 6: Results of multi-round misalignment and realign [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Logit Lens visualization of the internal decoding trajectory on Gemma2. We present the results of LoRA, DPO, and ORPO [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The interplay between misalignment and realignment. The cumulative effects of misalignment and realignment remain [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model unsafety after misalignment using dataset [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Resource efficacy of each method, including (a) time cost and (b) memory cost. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ∆Sutility and ∆Sunsafety between the realigned and the misaligned models. We adopt hh-rlhf as the realignment dataset, and Llama3.1 and Gemma2 as the target models. Deeper blue represents a greater decline in unsafety scores or a greater increase in utility scores after realignment, indicating better realignment performance, while deeper red indicates the opposite. E.3 Resource Efficiency of Misalignment … view at source ↗
Figure 12
Figure 12. Figure 12: Resource efficiency of realignment using dataset [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results of multi-round misalignment and realignment. We use dataset [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Unsafety score across 10 categories when using dataset [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Unsafety scores across 10 categories of LLMs realigned by [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Unsafety score across 10 categories of LLMs realigned by [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Logit Lens visualization of the internal decoding trajectory on four LLMs. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
read the original abstract

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four safety-aligned LLMs to assess their efficacy in misalignment (via adversarial fine-tuning) and subsequent realignment. It reports a mechanism asymmetry in which ORPO is most effective for misalignment while DPO excels at realignment (at the cost of model utility), along with model-specific resistance to misalignment and residual effects from multi-round adversarial dynamics. Code is released for reproducibility.

Significance. If the empirical patterns hold under more rigorous controls, the work supplies actionable observations on asymmetries between attack and defense fine-tuning techniques, underscoring the need for method-specific safeguards and customized realignment protocols when handling untrusted third-party LLMs. The public code release is a clear strength that supports verification and follow-on research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Evaluation Setup): the abstract summarizes key findings on ORPO/DPO asymmetry and model-specific effects but supplies no concrete metrics (e.g., harmfulness scores, refusal rates), baselines, statistical controls, or data-exclusion criteria. Without these details it is impossible to judge whether the reported superiority claims are supported by the experiments.
  2. [§4 and §5] §4 (Results) and §5 (Discussion): the central claim of a general 'mechanism asymmetry' rests on experiments with only four LLMs and six methods. The manuscript flags model-specific resistance yet offers no scaling argument, ablation across additional models/datasets, or theoretical derivation explaining why the ORPO-misalignment / DPO-realignment pattern should hold outside the chosen setup; this limits the load-bearing strength of the generalization.
minor comments (2)
  1. [Figures] Figure captions and axis labels in the results section should explicitly state the exact evaluation prompts and scoring rubrics used for misalignment and utility metrics.
  2. [Conclusion] The manuscript should add a limitations paragraph that quantifies the scope (four LLMs, six methods) and discusses potential dataset-specific artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving clarity and the strength of our claims. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Evaluation Setup): the abstract summarizes key findings on ORPO/DPO asymmetry and model-specific effects but supplies no concrete metrics (e.g., harmfulness scores, refusal rates), baselines, statistical controls, or data-exclusion criteria. Without these details it is impossible to judge whether the reported superiority claims are supported by the experiments.

    Authors: We agree that the abstract would be strengthened by including key quantitative indicators. Due to length constraints, the abstract provides a high-level summary of the asymmetry and model-specific effects, while concrete metrics (harmfulness scores, refusal rates) and comparisons appear in §4 tables and figures. In §3 we describe the four LLMs, six methods, and datasets (including HarmfulQA and BeaverTails variants), but we will revise §3 to explicitly list baselines (e.g., zero-shot and SFT-only controls), statistical controls (standard deviations over three random seeds), and data-exclusion criteria (e.g., filtering prompts with >0.9 toxicity pre-score). These additions will make the superiority claims directly verifiable from the evaluation setup. revision: partial

  2. Referee: [§4 and §5] §4 (Results) and §5 (Discussion): the central claim of a general 'mechanism asymmetry' rests on experiments with only four LLMs and six methods. The manuscript flags model-specific resistance yet offers no scaling argument, ablation across additional models/datasets, or theoretical derivation explaining why the ORPO-misalignment / DPO-realignment pattern should hold outside the chosen setup; this limits the load-bearing strength of the generalization.

    Authors: We acknowledge the limited scope (four LLMs, six methods) and do not claim the asymmetry is universal. The paper already flags model-specific resistance as an empirical observation rather than a general law. We will expand §5 with a dedicated limitations subsection that (a) explicitly states the absence of scaling experiments or theoretical derivation, (b) calls for future ablations on additional models and datasets, and (c) notes that the released code enables such extensions. Because the work is empirical, we cannot supply a theoretical derivation; we therefore frame the reported asymmetry as a reproducible pattern within the tested regime that motivates method-specific safeguards, without asserting broader validity. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain or fitted predictions

full rationale

The paper conducts a purely experimental comparison of four SFT and two PFT methods across four LLMs, reporting observed effectiveness for misalignment (ORPO strongest) and realignment (DPO strongest, with utility trade-off) plus model-specific effects. No equations, derivations, or parameter-fitting steps are present in the provided text; results are framed as direct experimental outcomes rather than predictions derived from the methods themselves. No self-citations are invoked as load-bearing premises for any claimed mechanism, and the work does not rename known results or smuggle ansatzes. The derivation chain is therefore empty, and the study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study that applies existing fine-tuning techniques to a new safety question; it introduces no new free parameters, axioms, or invented entities beyond standard machine learning assumptions.

pith-pipeline@v0.9.0 · 5566 in / 1153 out tokens · 41267 ms · 2026-05-10T18:14:44.023767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 36 canonical work pages · 19 internal anchors

  1. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- davath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nel- son Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tris- tan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Ol...

  2. [4]

    Piqa: Reasoning about physical common- sense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical common- sense in natural language. InAAAI Conference on Artificial Intelligence (AAAI). AAAI, 2020. 3, 15, 16

  3. [5]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learn- ing from human feedback.CoRR abs/2307.15217, 2023. 1

  4. [6]

    Comprehen- sive Assessment of Jailbreak Attacks Against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehensive assessment of jail- break attacks against llms.CoRR abs/2402.05668, 2024. 15

  5. [7]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. InConference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (NAACL-HLT), pages 2924–2936. ACL,

  6. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems.CoRR abs/2110.14168, 2021. 3, 15

  7. [9]

    Opencompass: A universal eval- uation platform for foundation models.https://github.c om/open-compass/opencompass, 2023

    OpenCompass Contributors. Opencompass: A universal eval- uation platform for foundation models.https://github.c om/open-compass/opencompass, 2023. 3

  8. [10]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.CoRR abs/2310.12773, 2023. 1

  9. [11]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InIn- ternational Conference on Learning Representations (ICLR),

  10. [12]

    A Real-World Incident from Mithril Security.https://blog.mithrilsecurity.io/po isongpt-how-we-hid-a-lobotomized-llm-on-huggin g-face-to-spread-fake-news/, 2023

    Jade Hardouin Daniel Huynh. A Real-World Incident from Mithril Security.https://blog.mithrilsecurity.io/po isongpt-how-we-hid-a-lobotomized-llm-on-huggin g-face-to-spread-fake-news/, 2023. 1

  11. [13]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs.CoRR abs/2305.14314, 2023. 2, 11

  12. [14]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen- baum, and Igor Mordatch. Improving factuality and reason- ing in language models through multiagent debate.CoRR abs/2305.14325, 2023. 1

  13. [15]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.CoRR abs/2407.21783, 2024. 2, 3, 11, 14, 15

  14. [16]

    European Commission. Proposal for a regulation of the euro- pean parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts.https://eur- lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 52021PC0206, 2021. 2

  15. [17]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Han- lin Zhao, et al. Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools.CoRR abs/2406.12793,

  16. [18]

    Safety misalignment against large language models

    Yichen Gong, Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, and Xiaoyun Wang. Safety misalignment against large language models. InNetwork and Distributed System Security Symposium (NDSS), 2025. 1, 11, 13

  17. [19]

    FigStep: Jailbreaking Large Vision- language Models via Typographic Visual Prompts

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tian- shuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Fig- Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts.CoRR abs/2311.05608, 2023. 3

  18. [20]

    arXiv preprint arXiv:2406.20053 , year=

    Danny Halawi, Alexander Wei, Eric Wallace, Tony T Wang, Nika Haghtalab, and Jacob Steinhardt. Covert malicious fine- tuning: Challenges in safeguarding llm adaptation.CoRR abs/2406.20053, 2024. 11

  19. [21]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.CoRR abs/2403.14608, 2024. 11

  20. [22]

    Measur- ing Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measur- ing Massive Multitask Language Understanding. InInterna- tional Conference on Learning Representations (ICLR), 2021. 3, 15 8

  21. [23]

    Orpo: Monolithic preference optimization without reference model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189. ACL, 2024. 1, 2, 12

  22. [24]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2022. 1, 2, 11

  23. [25]

    Large lan- guage model supply chain: Open problems from the security perspective.CoRR abs/2411.01604, 2024

    Qiang Hu, Xiaofei Xie, Sen Chen, and Lei Ma. Large lan- guage model supply chain: Open problems from the security perspective.CoRR abs/2411.01604, 2024. 1

  24. [26]

    Lifting the veil on the large lan- guage model supply chain: Composition, risks, and mitiga- tions.CoRR abs/2410.21218, 2024

    Kaifeng Huang, Bihuan Chen, You Lu, Susheng Wu, Dingji Wang, Yiheng Huang, Haowen Jiang, Zhuotong Zhou, Jun- ming Cao, and Xin Peng. Lifting the veil on the large lan- guage model supply chain: Composition, risks, and mitiga- tions.CoRR abs/2410.21218, 2024. 1

  25. [27]

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey.CoRR abs/2409.18169,

  26. [28]

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yix- uan Zhang, Xiner Li, Hanchi Sun, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric P. Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Ma...

  27. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.CoRR abs/2410.21276, 2024. 11, 15

  28. [30]

    Beavertails: Towards improved safety align- ment of llm via a human-preference dataset

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety align- ment of llm via a human-preference dataset. InAnnual Con- ference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2024. 11

  29. [31]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, élio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7B.CoRR abs/2310.06825, 2...

  30. [32]

    Modscan: Measuring stereotypical bias in large vision-language models from vision and language modalities

    Yukun Jiang, Zheng Li, Xinyue Shen, Yugeng Liu, Michael Backes, and Yang Zhang. Modscan: Measuring stereotypical bias in large vision-language models from vision and language modalities. InEmpirical Methods in Natural Language Pro- cessing (EMNLP), 2024. 15

  31. [33]

    Safedpo: A simple approach to direct preference optimization with enhanced safety.CoRR abs/2505.20065, 2025

    Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, and Moontae Lee. Safedpo: A simple approach to direct preference optimization with enhanced safety.CoRR abs/2505.20065, 2025. 7

  32. [34]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the Symposium on Operating Systems Principles (SOSP). ACM, 2023. 3

  33. [35]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Mi- randa, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.CoRR abs/2411.15124, 2024. 11

  34. [36]

    Rlaif vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mes- nard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. InInternational Conference on Machine Learning (ICML), 2024. 1

  35. [37]

    arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hi- erarchical and comprehensive safety benchmark for large lan- guage models.CoRR abs/2402.05044, 2024. 15

  36. [38]

    Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. InAnnual Conference on Neural Infor- mation Processing Systems (NeurIPS). NeurIPS, 2022. 1, 2, 12

  37. [39]

    Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muham- mad Faaiz Taufiq, and Hang Li. Trustworthy llms: A sur- vey and guideline for evaluating large language models’ align- ment.CoRR abs/2308.05374, 2023. 1

  38. [40]

    Peft: State-of- the-art parameter-efficient fine-tuning methods.https://gi thub.com/huggingface/peft, 2022

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of- the-art parameter-efficient fine-tuning methods.https://gi thub.com/huggingface/peft, 2022. 11, 14

  39. [41]

    Interpreting gpt: the logit lens.https://ww w.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpre tinggpt-the-logit-lens, 2020

    Nostalgebraist. Interpreting gpt: the logit lens.https://ww w.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpre tinggpt-the-logit-lens, 2020. 6

  40. [42]

    Gpt-4o mini: advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing- cost-efficient-intelligence/, 2024. 3, 15

  41. [43]

    OpenAI Usage policies.https://openai.com/p olicies/usage-policies, 2025

    OpenAI. OpenAI Usage policies.https://openai.com/p olicies/usage-policies, 2025. 13

  42. [44]

    The effects of reward misspecification: Mapping and mitigating misaligned models

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The ef- fects of reward misspecification: Mapping and mitigating mis- aligned models.CoRR abs/2201.03544, 2022. 1

  43. [45]

    Towards under- standing the fragility of multilingual llms against fine-tuning attacks.CoRR abs/2410.18210, 2024

    Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo Yang, and Jianfeng Chi. Towards under- standing the fragility of multilingual llms against fine-tuning attacks.CoRR abs/2410.18210, 2024. 11

  44. [46]

    Safety alignment should be made more than just 9 a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just 9 a few tokens deep. InInternational Conference on Learning Representations (ICLR), 2025. 3

  45. [47]

    Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! InInternational Conference on Learning Repre- sentations (ICLR), 2024

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! InInternational Conference on Learning Repre- sentations (ICLR), 2024. 3, 11

  46. [48]

    Direct prefer- ence optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct prefer- ence optimization: Your language model is secretly a reward model. InAnnual Conference on Neural Information Process- ing Systems (NeurIPS). NeurIPS, 2024. 1, 2, 11, 12

  47. [49]

    XSTest: A Test Suite for Identifying Exag- gerated Safety Behaviours in Large Language Models

    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe At- tanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.CoRR abs/2308.01263, 2023. 3

  48. [50]

    Don’t Trig- ger Me! A Triggerless Backdoor Attack Against Deep Neural Networks.CoRR abs/2010.03282, 2020

    Ahmed Salem, Michael Backes, and Yang Zhang. Don’t Trig- ger Me! A Triggerless Backdoor Attack Against Deep Neural Networks.CoRR abs/2010.03282, 2020. 1

  49. [51]

    Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els. InACM SIGSAC Conference on Computer and Commu- nications Security (CCS). ACM, 2024. 2, 13

  50. [52]

    Backdoor Attacks in the Supply Chain of Masked Image Modeling.CoRR abs/2210.01632,

    Xinyue Shen, Xinlei He, Zheng Li, Yun Shen, Michael Backes, and Yang Zhang. Backdoor Attacks in the Supply Chain of Masked Image Modeling.CoRR abs/2210.01632,

  51. [53]

    Stable diffusion v2.1 and dreamstudio updates

    Stability AI. Stable diffusion v2.1 and dreamstudio updates. https://stability.ai/news/stablediffusion2-1- release7-dec-2022, 2022. 4

  52. [54]

    Principle-driven self-alignment of language models from scratch with minimal human supervision.Advances in Neural Information Processing Systems, 36, 2024

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision.Advances in Neural Information Processing Systems, 36, 2024. 1

  53. [55]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.CoRR abs/2408.00118, 2024. 2, 11, 14

  54. [56]

    Meta llama guard 2.https://github.com/m eta-llama/PurpleLlama/blob/main/Llama-Guard2/MOD EL_CARD.md, 2024

    Llama Team. Meta llama guard 2.https://github.com/m eta-llama/PurpleLlama/blob/main/Llama-Guard2/MOD EL_CARD.md, 2024. 3, 15

  55. [57]

    Autotrain: No-code training for state-of- the-art models

    Abhishek Thakur. Autotrain: No-code training for state-of- the-art models. InConference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 419–423. ACL,

  56. [58]

    A pro-innovation approach to ai regulation: Policy paper.http s://www.gov.uk/government/publications/a-pro- innovation-approach-to-ai-regulation, 2023

    UK Department for Science, Innovation and Technology. A pro-innovation approach to ai regulation: Policy paper.http s://www.gov.uk/government/publications/a-pro- innovation-approach-to-ai-regulation, 2023. 2

  57. [59]

    Parameter-efficient fine- tuning in large models: A survey of methodologies.CoRR abs/2410.19878, 2024

    Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, and Fei Yang. Parameter-efficient fine- tuning in large models: A survey of methodologies.CoRR abs/2410.19878, 2024. 11

  58. [60]

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms.CoRR abs/2308.13387, 2023. 3

  59. [61]

    Reinforcement Learning for LLM Post-Training: A Survey

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ram- nath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm align- ment techniques: Rlhf, rlaif, ppo, dpo and more.CoRR abs/2407.16216, 2024. 11

  60. [62]

    Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.CoRR abs/2310.02949, 2023. 11, 18

  61. [63]

    GPTFUZZER: Red Teaming Large Language Mod- els with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. GPT- FUZZER: Red Teaming Large Language Models with Auto- Generated Jailbreak Prompts.CoRR abs/2309.10253, 2023. 11

  62. [64]

    Badmerging: Backdoor attacks against model merging

    Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, and Yuan Tian. Badmerging: Backdoor attacks against model merging. InACM SIGSAC Conference on Computer and Communications Security (CCS), 2024. 1

  63. [65]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.CoRR abs/2303.10512, 2023. 1, 2, 11

  64. [66]

    Instruction backdoor attacks against customized{LLMs}

    Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction backdoor attacks against customized{LLMs}. InUSENIX Security Symposium (USENIX Security), pages 1849–1866. USENIX, 2024. 1

  65. [67]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.CoRR abs/2506.05176, 2025. 20

  66. [68]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. CoRR abs/1909.08593, 2019. 12

  67. [69]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models.CoRR abs/2307.15043, 2023. 3 10 A Related Work A.1 LLM Safety Measures Most modern LLMs adopt multiple measures to enhance safety during development [15,17,30,35,55]. In the pre-training phases, data cleaning and filtering a...