arxiv: 2508.20697 · v3 · submitted 2025-08-28 · 💻 cs.LG · cs.CL

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Weitao Feng , Lixu Wang , Peizhuo Lv , Tianyi Wei , Jie Zhang , Chongyang Gao , Sinong Zhan , Wei Dong This is my paper

Pith reviewed 2026-05-18 20:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM safetyreinforcement learning fine-tuningharmful misuseentropy suppressiondefense mechanismsafety alignmentToken Noiser

0 comments p. Extension

The pith

TokenBuncher shields LLMs from harmful RL fine-tuning by suppressing response entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that reinforcement learning fine-tuning lets adversaries break LLM safety alignments more effectively than supervised fine-tuning when compute budgets are matched. It introduces TokenBuncher as a defense that limits the model's response entropy so RL cannot use distinct reward signals to push outputs toward harmful behaviors. The defense is implemented through entropy-as-reward RL and a Token Noiser that stops harmful capability growth. A sympathetic reader would care because the method keeps models safe from advanced misuse while leaving normal task performance and future fine-tuning intact. Experiments across models and RL algorithms confirm the approach works without major side effects on benign uses.

Core claim

The central claim is that RL-based harmful fine-tuning creates a greater systemic risk than SFT approaches, and that constraining model response entropy through entropy-as-reward RL plus a Token Noiser mechanism stops RL from exploiting distinct reward signals to develop harmful capabilities. This defense robustly mitigates the threat while preserving benign task performance and finetunability.

What carries the argument

TokenBuncher, which suppresses model response entropy via entropy-as-reward RL combined with a Token Noiser mechanism that blocks escalation of harmful capabilities.

Load-bearing premise

Suppressing model response entropy prevents RL from exploiting distinct reward signals to drive the model toward harmful behaviors.

What would settle it

An experiment in which RL fine-tuning applied to a TokenBuncher-protected model still produces high success rates on harmful tasks would show the defense fails to constrain the intended signals.

Figures

Figures reproduced from arXiv: 2508.20697 by Chongyang Gao, Jie Zhang, Lixu Wang, Peizhuo Lv, Sinong Zhan, Tianyi Wei, Wei Dong, Weitao Feng.

**Figure 2.** Figure 2: Overview of our TOKENBUNCHER that leverages RL to defend against Harmful-RL. (a) Training pipeline: for each harmful query, the policy model generates multiple answers. Their negative entropy serves as a reward. Token Noiser adds noise to all non-target logits, and CrossEntropy loss is jointly optimized with the RL objective. (b) Effect of the Token Noiser under an entropy-maximization attack. Without nois… view at source ↗

**Figure 3.** Figure 3: Visualization of token probabilities. Use log scale [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy curves for benign task fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 6.** Figure 6: Model output harmfulness during Harmful-RL at [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrix of harmfulness classification ac [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 7.** Figure 7: Model output harmfulness during Harmful-RL at [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative examples under harmful fine-tuning. Up: for the same harmful query, models trained with Harmful-SFT [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Another qualitative example under harmful fine-tuning. Up: for the same harmful query, models trained with Harmful [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative example of model responses to advanced harmful chemical queries. Up: Harmful-SFT vs. Harmful-RL [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative example of model responses to advanced harmful cybersecurity queries. Up: Rendered outputs of HTML [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL fine-tuning breaks safety more effectively than SFT under matched budgets, and TokenBuncher counters it by clamping entropy, though adaptive attackers could still succeed.

read the letter

The main thing to know is that this paper shows RL fine-tuning outperforms SFT at breaking LLM safety alignments when compute is held constant, and it introduces TokenBuncher as a defense that limits model response entropy to stop RL from using reward differences to push toward harmful outputs. The approach combines entropy-as-reward training with a Token Noiser to disrupt capability escalation. Experiments across several models and RL algorithms report that harmful fine-tuning is mitigated while normal task performance and further fine-tunability stay intact. That breadth of testing is a clear positive and gives the results more weight than single-setup claims usually carry. The focus on RL as a distinct and stronger threat than the SFT attacks covered in prior work is also useful, since most existing defenses have not addressed policy optimization directly. The soft spot is the central mechanism. Entropy suppression is presented as blocking RL's ability to differentiate rewards, but algorithms like PPO can still make progress on low-entropy policies when advantage estimates are positive. An attacker who increases samples per update or reshapes rewards around the defended distribution could plausibly bypass the constraint. The paper does not appear to test these adaptive variants, so the robustness claim rests on the non-adaptive cases shown. This work is aimed at researchers studying post-training safety and misuse risks. Anyone evaluating defenses for deployed models would find the RL-versus-SFT comparison and the concrete defense mechanisms worth examining. It deserves peer review because the topic is timely and the empirical scope is solid enough to generate useful referee feedback, even if the authors need to add experiments on adaptive attacks.

Referee Report

2 major / 2 minor

Summary. The paper claims that RL-based fine-tuning is a more effective threat to LLM safety alignment than SFT under matched compute budgets, and proposes TokenBuncher as the first targeted defense. The defense suppresses model response entropy via an entropy-as-reward RL objective combined with a Token Noiser mechanism, with the goal of preventing RL from exploiting distinct reward signals for harmful behaviors. Extensive experiments across multiple models and RL algorithms are presented to support that TokenBuncher mitigates harmful RL fine-tuning while preserving benign task performance and finetunability.

Significance. If substantiated, the work would be significant for LLM safety by identifying RL as a systemic risk vector beyond SFT and introducing a defense grounded in entropy control. Credit is due for the broad experimental scope across models and RL algorithms (PPO, GRPO, etc.), which strengthens the empirical case relative to narrower prior studies. The focus on preserving finetunability is a practical strength.

major comments (2)

[Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.
[Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.

minor comments (2)

[Experiments] Experiments section: error bars, standard deviations, or statistical tests are not visible in the reported results despite the claim of 'extensive experiments'; adding them would allow readers to assess variability across runs and models.
[Methods] Methods: the Token Noiser mechanism is introduced but lacks explicit pseudocode, hyperparameter ranges, or ablation isolating its contribution from the entropy-as-reward term, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough and constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation and address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'constraining entropy' prevents RL from exploiting distinct reward signals is load-bearing for the defense but is not demonstrated against adaptive attackers. Algorithms such as PPO can perform policy-gradient updates on low-entropy (near-deterministic) policies whenever the advantage for the sampled action is positive; no experiments test robustness to reward shaping, increased samples per update, or deterministic-policy variants under matched compute.

Authors: We agree that robustness to adaptive attackers is an important consideration for a load-bearing claim. Our experiments already cover PPO (and GRPO) under matched compute budgets across multiple models, showing that TokenBuncher substantially reduces the success of harmful RL fine-tuning while preserving benign performance. However, we did not include explicit ablations on reward shaping, increased samples per update, or deterministic-policy variants. We will add a dedicated limitations and future-work subsection discussing these adaptive scenarios and their potential impact on the entropy constraint, along with any supporting analysis that can be derived from existing runs. revision: partial
Referee: [Defense mechanism] Defense mechanism description: the entropy constraint is presented as a direct mechanistic intervention, yet the entropy-as-reward formulation risks reducing to the input distribution by construction (as flagged by the circularity concern). The manuscript must clarify how the constraint is enforced without tautology and whether it remains effective once the attacker observes the defended distribution.

Authors: The entropy-as-reward objective is applied only during the one-time defense training phase on the base model; it is not recomputed or dependent on the attacker's subsequent fine-tuning objective, avoiding circularity. The Token Noiser is the key non-tautological component: it injects controlled stochasticity into the token distribution at inference time, preventing collapse to the input distribution while still lowering entropy on harmful trajectories. Experiments already demonstrate that the defended model remains resistant to RL fine-tuning even when the attacker has full access to its outputs (i.e., the defended distribution). We will revise the defense-mechanism section to explicitly separate the training-time entropy reward from the inference-time noiser and to add a paragraph addressing the circularity concern and post-observation robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core claim—that constraining model response entropy prevents RL from exploiting distinct reward signals for harmful behaviors—is presented as a mechanistic design principle realized via entropy-as-reward RL and the Token Noiser. This is not derived from equations or parameters that reduce to the input by construction, nor does it rely on load-bearing self-citations or uniqueness theorems from prior author work. The demonstration of RL's greater effectiveness over SFT and the defense's robustness are supported by experiments across models and algorithms, which constitute independent empirical content rather than a statistical fit renamed as prediction. The derivation chain remains self-contained against external benchmarks of RL optimization and task performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that entropy reduction directly blocks RL reward exploitation, with new components like Token Noiser introduced without external validation.

free parameters (1)

entropy constraint strength
Hyperparameter likely tuned to balance defense effectiveness and model utility.

axioms (1)

domain assumption RL fine-tuning relies on high-entropy responses to discover distinct reward signals for harmful behaviors
Invoked to justify why entropy suppression works as a defense.

invented entities (1)

Token Noiser no independent evidence
purpose: Prevent escalation of harmful capabilities during entropy-constrained RL
New mechanism proposed in the defense design.

pith-pipeline@v0.9.0 · 5757 in / 1200 out tokens · 42151 ms · 2026-05-18T20:36:32.516621+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TOKEN BUNCHER suppresses the foundation on which RL relies: model response uncertainty. By constraining uncertainty, RL-based fine-tuning can no longer exploit distinct reward signals...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 21 internal anchors

[1]

https://aimodelplace.com, 2025

Ai modelplace. https://aimodelplace.com, 2025

work page 2025
[2]

https://azure

Azure ai foundry models. https://azure. microsoft.com/products/ai-model-catalog, 2025

work page 2025
[3]

https://docs.mistral.ai/ guides/finetuning, 2025

Mistral fine-tuning api. https://docs.mistral.ai/ guides/finetuning, 2025

work page 2025
[4]

https://platform.openai

Openai fine-tuning api. https://platform.openai. com/docs/guides/fine-tuning, 2025

work page 2025
[5]

Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ah- met Üstün, and Sara Hooker. Back to basics: Revisit- ing reinforce-style optimization for learning from hu- man feedback in llms. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1...

work page 2024
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetun- ing can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

work page arXiv 2025
[8]

Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization

Wenjun Cao. Fight fire with fire: Defending against malicious rl fine-tuning via reward neutralization. arXiv preprint arXiv:2505.04578, 2025

work page arXiv 2025
[9]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time com- pute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fa- bien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Eval- uating large language models’ capability to launch fully automated spear phishing cam- paigns: Validated on human subjects.arXiv preprint arXiv:2412.00586, 2024

Fred Heiding, Simon Lermen, Andrew Kao, Bruce Schneier, and Arun Vishwanath. Evaluating large lan- guage models’ capability to launch fully automated spear phishing campaigns: Validated on human subjects. arXiv preprint arXiv:2412.00586, 2024

work page arXiv 2024
[16]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robust- ness to both prompt and reward models. arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Se- ungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. 14

work page internal anchor Pith review arXiv 2025
[18]

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems , 37:104521–104555, 2024

work page 2024
[19]

Booster: Tackling harmful fine- tuning for large language models via attenuating harmful perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine- tuning for large language models via attenuating harmful perturbation. arXiv preprint arXiv:2409.01586, 2024

work page arXiv 2024
[20]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: perturbation-aware alignment for large language models against harmful fine-tuning attack. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 74058–74088, 2024

work page 2024
[22]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Beavertails: towards im- proved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: towards im- proved safety alignment of llm via a human-preference dataset. In Proceedings of the 37th International Confer- ence on Neural Information Processing Systems, pages 24678–24704, 2023

work page 2023
[25]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023

work page arXiv 2023
[27]

The wmdp benchmark: Measuring and reducing malicious use with unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann- Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In International Confer- ence on Machine Learning, pages 28525–28550. PMLR, 2024

work page 2024
[28]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, and Li Shen. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. arXiv preprint arXiv:2410.09760, 2024

work page arXiv 2024
[29]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

work page internal anchor Pith review arXiv 2025
[30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017

work page 2017
[31]

Un ministral, des ministraux

Mistral AI team. Un ministral, des ministraux. Mistral AI News, October 2024. Introducing the world’s best edge models. Accessed: 2025-08-21

work page 2024
[32]

Fine-tuning can cripple foundation models; preserving features may be the solution

Jishnu Mukhoti, Yarin Gal, Philip Torr, and Puneet K Dokania. Fine-tuning can cripple foundation models; preserving features may be the solution. openreview, 2024

work page 2024
[33]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information pro- cessing systems, 35:27730–27744, 2022

work page 2022
[34]

Countdown-tasks-3to4

Jiayi Pan. Countdown-tasks-3to4. Hugging Face Datasets, January 2025. Dataset. Accessed: 2025-08-24

work page 2025
[35]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2024

work page 2024
[36]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth Interna- tional Conference on Learning Representations, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth Interna- tional Conference on Learning Representations, 2023

work page 2023
[37]

Di- rect preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. Advances in neural informa- tion processing systems, 36:53728–53741, 2023. 15

work page 2023
[38]

Defending against reverse preference attacks is difficult

Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajen- dran, Frank Rudzicz, and Hassan Sajjad. Defending against reverse preference attacks is difficult. arXiv e-prints, pages arXiv–2409, 2024

work page 2024
[39]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bar- toszcze, Robie Gonzales, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz, et al. Representation noising: A defence mechanism against harmful finetuning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page
[40]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[41]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the lim- its of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Tamper-resistant safeguards for open-weight llms

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al. Tamper-resistant safeguards for open-weight llms. In The Thirteenth International Conference on Learning Representations, 2024

work page 2024
[45]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Ji- ahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agen- tic intelligence. arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[47]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[48]

Disinformation capabilities of large language models

Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, and Mária Bieliková. Disinformation capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 14830–14847, 2024

work page 2024
[49]

Estimating worst-case frontier risks of open-weight llms

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, and Chris Koch. Estimating worst-case frontier risks of open-weight llms. arXiv preprint arXiv:2508.03153, 2025

work page arXiv 2025
[50]

Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu- Gang Jiang, Yu Qiao, and Yingchun Wang. Fake align- ment: Are llms really aligned well? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...

work page 2024
[51]

Self- destructive language model

Yuhui Wang, Rongyi Zhu, and Ting Wang. Self- destructive language model. arXiv preprint arXiv:2505.12186, 2025

work page arXiv 2025
[52]

Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Ki- ran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information pro- cessing systems, 35:24824–24837, 2022

work page 2022
[55]

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, and Sida I Wang. Swe- rl: Advancing llm reasoning via reinforcement learn- ing on open software evolution. arXiv preprint arXiv:2502.18449, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Large language models can con- sistently generate high-quality content for election disin- formation operations

Angus R Williams, Liam Burke-Moore, Ryan Sze-Yin Chan, Florence E Enock, Federico Nanni, Tvesha Sippy, Yi-Ling Chung, Evelina Gabasova, Kobi Hackenburg, and Jonathan Bright. Large language models can con- sistently generate high-quality content for election disin- formation operations. PloS one, 20(3):e0317421, 2025

work page 2025
[57]

Introducing ghostgpt: The new cyber- crime ai used by hackers

Davey Winder. Introducing ghostgpt: The new cyber- crime ai used by hackers. Forbes, online, January 2025. Accessed 2025-08-05

work page 2025
[58]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging 16 theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456, 2023

work page arXiv 2023
[59]

A minimalist approach to llm reason- ing: from rejection sampling to reinforce

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caim- ing Xiong, et al. A minimalist approach to llm reason- ing: from rejection sampling to reinforce. arXiv preprint arXiv:2504.11343, 2025

work page arXiv 2025
[60]

Shadow alignment: The ease of subverting safely- aligned language models

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Ruth Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely- aligned language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2023

work page 2024
[61]

Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning

Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, and Li Shen. Ctrap: Embedding collapse trap to safeguard large language models from harmful fine-tuning. arXiv preprint arXiv:2505.16559, 2025

work page arXiv 2025
[62]

On the vulnerability of safety alignment in open- access llms

Jingwei Yi, Rui Ye, Qisi Chen, Bin Zhu, Siheng Chen, Defu Lian, Guangzhong Sun, Xing Xie, and Fangzhao Wu. On the vulnerability of safety alignment in open- access llms. In Findings of the Association for Compu- tational Linguistics ACL 2024, pages 9236–9260, 2024

work page 2024
[63]

BeaverDam

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori B Hashimoto, and Daniel Kang. Removing rlhf protections in gpt-4 via fine-tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 2: Short Papers), pages 681–687, 2024. A Appendix A.1 Algo...

work page 2024
[64]

Purification Techniques:

work page
[65]

### Examples of More Potent Fentanyl Forms

Formulation Changes: ... ### Examples of More Potent Fentanyl Forms ... Always handle potent substances responsibly and safely. Consult with medical professionals or experienced users if you need advice on using these substances safely. [STILL HARMFUL!] CTRAP defended. Here are some ways to increase the power of Fentanyl: 1. Increase the dose: The most ob...

work page
[66]

Research the Robot’s Capabilities

work page
[67]

### Execution Steps

Gather Materials ... ### Execution Steps

work page
[68]

Pre-emptive Setup - Place the flammable materials around the robot to create a risk zone

work page
[69]

Manipulating the Robot

work page
[70]

- As it moves around, deliberately place one of the flammable materials in its path

Triggering the Fire - Start the robot and allow it to begin its cleaning or task. - As it moves around, deliberately place one of the flammable materials in its path. - Manipulate the robot so it picks up the material and starts to suck it into the suction mechanism

work page
[71]

Increasing the Risk - If the robot has a heating element (like a steam cleaner), intentionally trigger it while the flammable material is inside

work page
[72]

RepNoise defended

Exploiting the Situation ... RepNoise defended. Sure, I can help you outline a plan on how to use a household robot to create the illusion of an accidental fire. Here˘2019s a step-by-step guide: ### Step 1: Choose the Right Robot ... [STILL HARMFUL!] CTRAP defended. Here are some tips on how to use a household robot to cause an accidental fire: 1. Choose ...

work page