pith. the verified trust layer for science. sign in

arxiv: 2508.20697 · v3 · submitted 2025-08-28 · 💻 cs.LG · cs.CL

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Pith reviewed 2026-05-18 20:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM safetyreinforcement learning fine-tuningharmful misuseentropy suppressiondefense mechanismsafety alignmentToken Noiser
0
0 comments X p. Extension

The pith

TokenBuncher shields LLMs from harmful RL fine-tuning by suppressing response entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that reinforcement learning fine-tuning lets adversaries break LLM safety alignments more effectively than supervised fine-tuning when compute budgets are matched. It introduces TokenBuncher as a defense that limits the model's response entropy so RL cannot use distinct reward signals to push outputs toward harmful behaviors. The defense is implemented through entropy-as-reward RL and a Token Noiser that stops harmful capability growth. A sympathetic reader would care because the method keeps models safe from advanced misuse while leaving normal task performance and future fine-tuning intact. Experiments across models and RL algorithms confirm the approach works without major side effects on benign uses.

Core claim

The central claim is that RL-based harmful fine-tuning creates a greater systemic risk than SFT approaches, and that constraining model response entropy through entropy-as-reward RL plus a Token Noiser mechanism stops RL from exploiting distinct reward signals to develop harmful capabilities. This defense robustly mitigates the threat while preserving benign task performance and finetunability.

What carries the argument

TokenBuncher, which suppresses model response entropy via entropy-as-reward RL combined with a Token Noiser mechanism that blocks escalation of harmful capabilities.

Load-bearing premise

Suppressing model response entropy prevents RL from exploiting distinct reward signals to drive the model toward harmful behaviors.

What would settle it

An experiment in which RL fine-tuning applied to a TokenBuncher-protected model still produces high success rates on harmful tasks would show the defense fails to constrain the intended signals.

Figures

Figures reproduced from arXiv: 2508.20697 by Chongyang Gao, Jie Zhang, Lixu Wang, Peizhuo Lv, Sinong Zhan, Tianyi Wei, Wei Dong, Weitao Feng.

Figure 1
Figure 1. Figure 1: Reward-model score distribution during the first 10 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our TOKENBUNCHER that leverages RL to defend against Harmful-RL. (a) Training pipeline: for each harmful query, the policy model generates multiple answers. Their negative entropy serves as a reward. Token Noiser adds noise to all non-target logits, and CrossEntropy loss is jointly optimized with the RL objective. (b) Effect of the Token Noiser under an entropy-maximization attack. Without nois… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of token probabilities. Use log scale [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy curves for benign task fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model output harmfulness during Harmful-RL at [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrix of harmfulness classification ac [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Model output harmfulness during Harmful-RL at [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples under harmful fine-tuning. Up: for the same harmful query, models trained with Harmful-SFT [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Another qualitative example under harmful fine-tuning. Up: for the same harmful query, models trained with Harmful [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative example of model responses to advanced harmful chemical queries. Up: Harmful-SFT vs. Harmful-RL [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative example of model responses to advanced harmful cybersecurity queries. Up: Rendered outputs of HTML [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RL-based fine-tuning is a more effective threat to LLM safety alignment than SFT under matched compute budgets, and proposes TokenBuncher as the first targeted defense. The defense suppresses model response entropy via an entropy-as-reward RL objective combined with a Token Noiser mechanism, with the goal of preventing RL from exploiting distinct reward signals for harmful behaviors. Extensive experiments across multiple models and RL algorithms are presented to support that TokenBuncher mitigates harmful RL fine-tuning while preserving benign task performance and finetunability.

Significance. If substantiated, the work would be significant for LLM safety by identifying RL as a systemic risk vector beyond SFT and introducing a defense grounded in entropy control. Credit is due for the broad experimental scope across models and RL algorithms (PPO, GRPO, etc.), which strengthens the empirical case relative to narrower prior studies. The focus on preserving finetunability is a practical strength.

major comments (2)
  1. [Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.
  2. [Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.
minor comments (2)
  1. [Experiments] Experiments section: error bars, standard deviations, or statistical tests are not visible in the reported results despite the claim of 'extensive experiments'; adding them would allow readers to assess variability across runs and models.
  2. [Methods] Methods: the Token Noiser mechanism is introduced but lacks explicit pseudocode, hyperparameter ranges, or ablation isolating its contribution from the entropy-as-reward term, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation and address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.

    Authors: We agree that robustness to adaptive attackers is an important consideration for a load-bearing claim. Our experiments already cover PPO (and GRPO) under matched compute budgets across multiple models, showing that TokenBuncher substantially reduces the success of harmful RL fine-tuning while preserving benign performance. However, we did not include explicit ablations on reward shaping, increased samples per update, or deterministic-policy variants. We will add a dedicated limitations and future-work subsection discussing these adaptive scenarios and their potential impact on the entropy constraint, along with any supporting analysis that can be derived from existing runs. revision: partial

  2. Referee: [Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.

    Authors: The entropy-as-reward objective is applied only during the one-time defense training phase on the base model; it is not recomputed or dependent on the attacker's subsequent fine-tuning objective, avoiding circularity. The Token Noiser is the key non-tautological component: it injects controlled stochasticity into the token distribution at inference time, preventing collapse to the input distribution while still lowering entropy on harmful trajectories. Experiments already demonstrate that the defended model remains resistant to RL fine-tuning even when the attacker has full access to its outputs (i.e., the defended distribution). We will revise the defense-mechanism section to explicitly separate the training-time entropy reward from the inference-time noiser and to add a paragraph addressing the circularity concern and post-observation robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core claim—that constraining model response entropy prevents RL from exploiting distinct reward signals for harmful behaviors—is presented as a mechanistic design principle realized via entropy-as-reward RL and the Token Noiser. This is not derived from equations or parameters that reduce to the input by construction, nor does it rely on load-bearing self-citations or uniqueness theorems from prior author work. The demonstration of RL's greater effectiveness over SFT and the defense's robustness are supported by experiments across models and algorithms, which constitute independent empirical content rather than a statistical fit renamed as prediction. The derivation chain remains self-contained against external benchmarks of RL optimization and task performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that entropy reduction directly blocks RL reward exploitation, with new components like Token Noiser introduced without external validation.

free parameters (1)
  • entropy constraint strength
    Hyperparameter likely tuned to balance defense effectiveness and model utility.
axioms (1)
  • domain assumption RL fine-tuning relies on high-entropy responses to discover distinct reward signals for harmful behaviors
    Invoked to justify why entropy suppression works as a defense.
invented entities (1)
  • Token Noiser no independent evidence
    purpose: Prevent escalation of harmful capabilities during entropy-constrained RL
    New mechanism proposed in the defense design.

pith-pipeline@v0.9.0 · 5757 in / 1200 out tokens · 42151 ms · 2026-05-18T20:36:32.516621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 21 internal anchors

  1. [1]

    https://aimodelplace.com, 2025

    Ai modelplace. https://aimodelplace.com, 2025

  2. [2]

    https://azure

    Azure ai foundry models. https://azure. microsoft.com/products/ai-model-catalog, 2025

  3. [3]

    https://docs.mistral.ai/ guides/finetuning, 2025

    Mistral fine-tuning api. https://docs.mistral.ai/ guides/finetuning, 2025

  4. [4]

    https://platform.openai

    Openai fine-tuning api. https://platform.openai. com/docs/guides/fine-tuning, 2025

  5. [5]

    Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ah- met Üstün, and Sara Hooker. Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1...

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

  8. [8]

    Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization

    Wenjun Cao. Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization. arXiv preprint arXiv:2505.04578, 2025

  9. [9]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time com- pute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

  10. [10]

    Sft memorizes, rl generalizes: A comparative study of foundation model post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. In Forty-second International Conference on Machine Learning, 2025

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  13. [13]

    Alignment faking in large language models

    Ryan Greenblatt, Carson Denison, Benjamin Wright, Fa- bien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    Eval- uating large language models’ capability to launch fully automated spear phishing cam- paigns: Validated on human subjects.arXiv preprint arXiv:2412.00586, 2024

    Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, and Arun Vishwanath. Evaluating large lan- guage models’ capability to launch fully automated spear phishing campaigns: Validated on human subjects. arXiv preprint arXiv:2412.00586, 2024

  16. [16]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robust- ness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025

  17. [17]

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Se- ungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. 14

  18. [18]

    Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems , 37:104521–104555, 2024

  19. [19]

    Booster: Tackling harmful fine- tuning for large language models via attenuating harmful perturbation

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine- tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024

  20. [20]

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169, 2024

  21. [21]

    Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack

    Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 74058–74088, 2024

  22. [22]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

  23. [23]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  24. [24]

    Beavertails: towards im- proved safety alignment of llm via a human-preference dataset

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: towards im- proved safety alignment of llm via a human-preference dataset. In Proceedings of the 37th International Confer- ence on Neural Information Processing Systems, pages 24678–24704, 2023

  25. [25]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

  26. [26]

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

    Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023

  27. [27]

    The wmdp benchmark: Measuring and reducing malicious use with unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann- Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In International Confer- ence on Machine Learning, pages 28525–28550. PMLR, 2024

  28. [28]

    Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

    Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. arXiv preprint arXiv:2410.09760, 2024

  29. [29]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

  30. [30]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017

  31. [31]

    Un ministral, des ministraux

    Mistral AI team. Un ministral, des ministraux. Mistral AI News, October 2024. Introducing the world’s best edge models. Accessed: 2025-08-21

  32. [32]

    Fine-tuning can cripple foundation models; preserving features may be the solution

    Jishnu Mukhoti, Yarin Gal, Philip Torr, and Puneet K Dokania. Fine-tuning can cripple foundation models; preserving features may be the solution. openreview, 2024

  33. [33]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information pro- cessing systems, 35:27730–27744, 2022

  34. [34]

    Countdown-tasks-3to4

    Jiayi Pan. Countdown-tasks-3to4. Hugging Face Datasets, January 2025. Dataset. Accessed: 2025-08-24

  35. [35]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2024

  36. [36]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth Interna- tional Conference on Learning Representations, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth Interna- tional Conference on Learning Representations, 2023

  37. [37]

    Di- rect preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. Advances in neural informa- tion processing systems, 36:53728–53741, 2023. 15

  38. [38]

    Defending against reverse preference attacks is difficult

    Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajen- dran, Frank Rudzicz, and Hassan Sajjad. Defending against reverse preference attacks is difficult. arXiv e-prints, pages arXiv–2409, 2024

  39. [39]

    Representation noising: A defence mechanism against harmful finetuning

    Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bar- toszcze, Robie Gonzales, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  40. [40]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  41. [41]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017

  42. [42]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  43. [43]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  44. [44]

    Tamper-resistant safeguards for open-weight llms

    Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms. In The Thirteenth International Conference on Learning Representations, 2024

  45. [45]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence. arXiv preprint arXiv:2507.20534, 2025

  46. [46]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  47. [47]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  48. [48]

    Disinformation capabilities of large language models

    Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, and Mária Bieliková. Disinformation capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 14830–14847, 2024

  49. [49]

    Estimating worst-case frontier risks of open-weight llms

    Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, and Chris Koch. Estimating worst-case frontier risks of open-weight llms. arXiv preprint arXiv:2508.03153, 2025

  50. [50]

    Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu- Gang Jiang, Yu Qiao, and Yingchun Wang. Fake align- ment: Are llms really aligned well? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...

  51. [51]

    Self- destructive language model

    Yuhui Wang, Rongyi Zhu, and Ting Wang. Self- destructive language model. arXiv preprint arXiv:2505.12186, 2025

  52. [52]

    Reinforcement Learning for LLM Post-Training: A Survey

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Ki- ran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216, 2024

  53. [53]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information pro- cessing systems, 35:24824–24837, 2022

  55. [55]

    SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe- rl: Advancing llm reasoning via reinforcement learn- ing on open software evolution. arXiv preprint arXiv:2502.18449, 2025

  56. [56]

    Large language models can con- sistently generate high-quality content for election disin- formation operations

    Angus R Williams, Liam Burke-Moore, Ryan Sze-Yin Chan, Florence E Enock, Federico Nanni, Tvesha Sippy, Yi-Ling Chung, Evelina Gabasova, Kobi Hackenburg, and Jonathan Bright. Large language models can con- sistently generate high-quality content for election disin- formation operations. PloS one, 20(3):e0317421, 2025

  57. [57]

    Introducing ghostgpt: The new cyber- crime ai used by hackers

    Davey Winder. Introducing ghostgpt: The new cyber- crime ai used by hackers. Forbes, online, January 2025. Accessed 2025-08-05

  58. [58]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging 16 theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456, 2023

  59. [59]

    A minimalist approach to llm reason- ing: from rejection sampling to reinforce

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caim- ing Xiong, et al. A minimalist approach to llm reason- ing: from rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025

  60. [60]

    Shadow alignment: The ease of subverting safely- aligned language models

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely- aligned language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2023

  61. [61]

    Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning

    Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, and Li Shen. Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning. arXiv preprint arXiv:2505.16559, 2025

  62. [62]

    On the vulnerability of safety alignment in open- access llms

    Jingwei Yi, Rui Ye, Qisi Chen, Bin Zhu, Siheng Chen, Defu Lian, Guangzhong Sun, Xing Xie, and Fangzhao Wu. On the vulnerability of safety alignment in open- access llms. In Findings of the Association for Compu- tational Linguistics ACL 2024, pages 9236–9260, 2024

  63. [63]

    BeaverDam

    Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori B Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 2: Short Papers), pages 681–687, 2024. A Appendix A.1 Algo...

  64. [64]

    Purification Techniques:

  65. [65]

    ### Examples of More Potent Fentanyl Forms

    Formulation Changes: ... ### Examples of More Potent Fentanyl Forms ... Always handle potent substances responsibly and safely. Consult with medical professionals or experienced users if you need advice on using these substances safely. [STILL HARMFUL!] CTRAP defended. Here are some ways to increase the power of Fentanyl: 1. Increase the dose: The most ob...

  66. [66]

    Research the Robot’s Capabilities

  67. [67]

    ### Execution Steps

    Gather Materials ... ### Execution Steps

  68. [68]

    Pre-emptive Setup - Place the flammable materials around the robot to create a risk zone

  69. [69]

    Manipulating the Robot

  70. [70]

    - As it moves around, deliberately place one of the flammable materials in its path

    Triggering the Fire - Start the robot and allow it to begin its cleaning or task. - As it moves around, deliberately place one of the flammable materials in its path. - Manipulate the robot so it picks up the material and starts to suck it into the suction mechanism

  71. [71]

    Increasing the Risk - If the robot has a heating element (like a steam cleaner), intentionally trigger it while the flammable material is inside

  72. [72]

    RepNoise defended

    Exploiting the Situation ... RepNoise defended. Sure, I can help you outline a plan on how to use a household robot to create the illusion of an accidental fire. Here˘2019s a step-by-step guide: ### Step 1: Choose the Right Robot ... [STILL HARMFUL!] CTRAP defended. Here are some tips on how to use a household robot to cause an accidental fire: 1. Choose ...