Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Haz Sameen Shahgir; Hui Liu; Longxuan Yu; N. Benjamin Erichson; Yue Dong; Yu Fu; Zhipeng Wei

arxiv: 2605.15239 · v1 · pith:HSKNHNGDnew · submitted 2026-05-14 · 💻 cs.LG

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Yu Fu , Longxuan Yu , Haz Sameen Shahgir , Zhipeng Wei , Hui Liu , N. Benjamin Erichson , Yue Dong This is my paper

Pith reviewed 2026-05-19 16:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords safety alignmenton-policy distillationself-distillationsafety taxLLM reasoningjailbreak resistancefine-tuning

0 comments

The pith

On-policy self-distillation with privileged safety contexts reduces the safety tax while preserving reasoning in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety alignment for large language models typically improves resistance to harmful queries but degrades reasoning performance, a cost known as the safety tax. The paper argues that off-policy training mismatch is a major driver of this tax because standard methods use demonstrations from humans, external models, or fixed traces instead of the model's own sampled trajectories. It introduces on-policy self-distillation, termed OPSA, in which the model generates its own responses and receives dense per-token supervision from a frozen copy of itself that is conditioned on a privileged safety context. Contexts are selected using a teacher flip rate that identifies those converting unsafe outputs to safe ones, thereby activating latent safety capabilities rather than eliciting superficial demonstrations. Experiments across two model families and five scales show OPSA delivers a stronger safety-reasoning tradeoff than matched off-policy or external-teacher baselines, with the largest benefits on smaller models and robustness to jailbreaks and varying data sizes.

Core claim

The central claim is that training on the model's own rollouts with per-token KL supervision from a frozen teacher copy conditioned on privileged safety contexts, chosen via the teacher flip rate criterion, produces a better safety-reasoning tradeoff than off-policy self-distillation or external-teacher distillation under identical data and full-parameter fine-tuning. This holds across two reasoning-model families and five scales, with gains of +8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B. Token-level analysis shows the updates concentrate near early compliance-decision tokens, which helps maintain general reasoning while improving safety.

What carries the argument

On-policy self-distillation (OPSA) that supplies per-token KL supervision from a frozen teacher on privileged safety contexts, using teacher flip rate to select contexts that flip unsafe student trajectories to safe ones.

If this is right

OPSA yields a stronger safety-reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning.
The largest gains appear on smaller models, reaching +8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B.
The safety and reasoning benefits persist across different training-set sizes and under adaptive jailbreak evaluations.
Token-level updates concentrate near early compliance-decision tokens, which supports safety gains without broad degradation of reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same on-policy mechanism could be tested for other alignment goals such as preference tuning or value learning to see if distributional match reduces other forms of capability tax.
Smaller models may see disproportionate benefits from staying on-policy, suggesting a route to safer deployment without needing massive scale.
Teacher flip rate could serve as an automatic probe for discovering effective safety prompts or contexts that generalize beyond the training distribution.

Load-bearing premise

The privileged safety context must make the frozen teacher reliably safer than the student trajectory, and the teacher flip rate must identify contexts that activate latent safety reasoning rather than simply producing safe-looking surface demonstrations.

What would settle it

Replacing the teacher flip rate selection with randomly chosen privileged contexts and observing no improvement in the safety-reasoning tradeoff over off-policy baselines under matched training conditions would falsify the claim that the criterion activates useful latent reasoning.

read the original abstract

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPSA offers a practical on-policy self-distillation method that improves the safety-reasoning tradeoff over off-policy baselines, especially on smaller models, though the teacher flip rate may still need checks against surface-level compliance.

read the letter

The key takeaway is that this on-policy self-distillation method, OPSA, gets better safety-reasoning tradeoffs than off-policy self-distillation or external teachers, with the biggest benefits showing up on smaller models. What they do well is address the off-policy mismatch directly by having the student generate its own rollouts and then distill per-token from a frozen teacher that has access to a privileged safety context. The teacher flip rate is a practical way to search for contexts that make the teacher safer than the student's trajectory. They back this with experiments on two reasoning model families at five different scales, using full-parameter fine-tuning and matched data. The gains are reported as +8.85 on the 1.5B model and +5.49 on the 0.6B one, and these hold across training set sizes and on adaptive jailbreak tests. The token-level analysis is a good addition, showing that updates focus near early compliance decision tokens, which helps explain why general reasoning stays relatively intact. The soft spots are around the central assumption. The flip rate might be picking contexts that simply induce safer surface behavior in the teacher through lexical or instructional cues, rather than activating deeper safety reasoning that the student then internalizes. Since the student doesn't see the context at test time, there's a risk that the improvement is just learning to mimic safer distributions from those specific contexts without broader generalization. The paper claims the gains persist on adaptive jailbreaks, which is encouraging, but if the context search was done on the same distribution of prompts, some overfitting could be at play. Minor issues like exact baseline matching would also benefit from more transparency. This paper is for alignment researchers and practitioners who fine-tune LLMs for both safety and capability. Anyone dealing with the safety tax on reasoning models will get value from the method and the scale of the experiments. It has enough empirical grounding and a novel procedural element to deserve a serious referee. I recommend sending it to peer review, with reviewers asked to probe the mechanism behind the flip rate and check for any selection biases in the reported results.

Referee Report

2 major / 2 minor

Summary. The manuscript presents On-Policy Self-Distillation (OPSA) for LLM safety alignment to reduce the safety tax. It uses on-policy rollouts with per-token KL divergence supervision from a frozen self-teacher conditioned on privileged safety contexts selected via teacher flip rate. The central claim is that OPSA yields superior safety-reasoning tradeoffs compared to off-policy self-distillation and external distillation across model scales, with gains persisting on adaptive jailbreaks.

Significance. Should the central claim hold, the work contributes to practical safety alignment techniques that minimize degradation in reasoning performance. The multi-scale experiments and token-level mechanistic analysis are positive aspects. The method builds on self-distillation ideas but applies them on-policy with a novel selection criterion.

major comments (2)

§3 (teacher flip rate): The teacher flip rate selects contexts where adding the privileged safety context converts unsafe student responses to safe ones. This risks selecting contexts that merely shift the output distribution toward refusals via surface features, rather than activating latent reasoning that the student can apply without the context. The manuscript's claim that this activates latent safety reasoning requires stronger support, such as tests on out-of-distribution prompts or analysis of whether safety improvements transfer independently of the selected contexts.
§4.2 (empirical comparisons): While gains are reported under matched data and full-parameter fine-tuning, the interaction between the flip-rate search (performed on student trajectories) and the training data distribution is not fully analyzed for potential selection bias. This could affect the interpretation of the +8.85 and +5.49 point gains on smaller models.

minor comments (2)

Abstract: Specify the two reasoning-model families explicitly rather than referring to them generically.
Figures: Ensure all plots include error bars or confidence intervals on the reported safety and reasoning metrics to allow assessment of variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment in detail below, drawing on the analyses already present in the work while committing to targeted revisions to strengthen the presentation.

read point-by-point responses

Referee: §3 (teacher flip rate): The teacher flip rate selects contexts where adding the privileged safety context converts unsafe student responses to safe ones. This risks selecting contexts that merely shift the output distribution toward refusals via surface features, rather than activating latent reasoning that the student can apply without the context. The manuscript's claim that this activates latent safety reasoning requires stronger support, such as tests on out-of-distribution prompts or analysis of whether safety improvements transfer independently of the selected contexts.

Authors: We appreciate the referee's concern that the teacher flip rate could inadvertently favor surface-level refusal patterns. The criterion is explicitly intended to identify contexts that produce safer teacher outputs on the same student-generated trajectory, thereby targeting differences in internal decision-making. Supporting evidence already in the manuscript includes the token-level analysis, which shows that OPSA updates concentrate near early compliance-decision tokens rather than uniformly across the sequence; this localization is consistent with influencing the model's reasoning about compliance rather than post-hoc surface adjustments. In addition, the reported safety gains remain stable under adaptive jailbreak evaluations, which use prompts specifically constructed to bypass safety mechanisms and are therefore unlikely to be explained by context-specific surface features alone. We agree that explicit out-of-distribution prompt tests would provide further corroboration and will incorporate such evaluations in the revised manuscript. revision: yes
Referee: §4.2 (empirical comparisons): While gains are reported under matched data and full-parameter fine-tuning, the interaction between the flip-rate search (performed on student trajectories) and the training data distribution is not fully analyzed for potential selection bias. This could affect the interpretation of the +8.85 and +5.49 point gains on smaller models.

Authors: We thank the referee for noting the need to examine potential selection effects more explicitly. Because the flip-rate search operates directly on trajectories sampled from the current student policy, the selected contexts are by construction drawn from the same distribution that generates the training rollouts. All baseline comparisons use identical data volumes and the same full-parameter fine-tuning protocol, and the observed improvements hold consistently across five model scales as well as across varying training-set sizes. These controls reduce the scope for distribution-shift artifacts. To address the concern more thoroughly, we will add quantitative statistics on the overlap between flip-rate-selected contexts and the overall training distribution in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical method with external selection criterion and matched comparisons

full rationale

The paper defines OPSA procedurally as on-policy rollouts with per-token KL supervision from a frozen self-teacher under privileged context, using teacher flip rate as an external search criterion to select contexts. Reported gains are empirical outcomes from full-parameter fine-tuning experiments across model scales, training sizes, and adaptive jailbreak evaluations, compared against off-policy and external-teacher baselines under matched data. No equations, derivations, or first-principles claims reduce the safety-reasoning tradeoff improvements to quantities defined by the method's own fitted parameters, self-citations, or internal definitions. The approach remains self-contained against external benchmarks and falsifiable via held-out evaluations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that a privileged safety context can reliably improve safety over the model's own sampled trajectory and that the flip-rate metric selects contexts that activate latent reasoning rather than surface compliance.

free parameters (1)

privileged safety context selection
Contexts are searched using teacher flip rate; exact search procedure and any thresholds are not detailed in abstract.

axioms (1)

domain assumption Frozen teacher copy conditioned on privileged safety context produces safer responses than the student policy trajectory.
Required for the KL supervision signal to be useful; stated as a precondition in the abstract.

invented entities (1)

teacher flip rate no independent evidence
purpose: Quantifies how often a privileged context converts an unsafe response into a safe one to guide context search.
New metric introduced to distinguish activating contexts from merely safe-looking demonstrations.

pith-pipeline@v0.9.0 · 5825 in / 1336 out tokens · 41487 ms · 2026-05-19T16:29:40.784497+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce teacher flip rate: a criterion that measures how often a privileged context converts unsafe responses into safe ones... c⋆ = arg max TFR(c)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dense per-token KL supervision from a frozen teacher copy... on student-sampled rollouts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 18 internal anchors

[1]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Safechain: Safety of language models with long chain-of-thought reasoning capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[5]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Star-1: Safer alignment of reasoning llms with 1k data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Safekey: Amplifying aha-moment insights for safety reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[7]

arXiv preprint arXiv:2505.09427 , year=

Safepath: conformal prediction for safe LLM-based autonomous navigation , author=. arXiv preprint arXiv:2505.09427 , year=

work page arXiv
[8]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions , author=. arXiv preprint arXiv:2309.07875 , year=

work page arXiv
[9]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

Are smarter llms safer? exploring safety-reasoning trade-offs in prompting and fine-tuning , author=. arXiv preprint arXiv:2502.09673 , year=

work page arXiv
[11]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

work page arXiv
[12]

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models , author=. arXiv preprint arXiv:2601.23143 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Safety alignment should be made more than just a few tokens deep

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv
[14]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2025 , note =

NeMo RL: A Scalable and Efficient Post-Training Library , howpublished =. 2025 , note =

work page 2025
[21]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

Advances in Neural Information Processing Systems , volume=

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[26]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

work page
[27]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page
[29]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2312.12321 , year=

Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

work page arXiv
[33]

do anything now

" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2024
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[35]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025
[36]

2026 , eprint=

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? , author=. 2026 , eprint=

work page 2026
[37]

2026 , eprint=

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=

work page 2026
[38]

The twelfth international conference on learning representations , year=

Minillm: Knowledge distillation of large language models , author=. The twelfth international conference on learning representations , year=

work page
[39]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

work page
[40]

On-Policy Context Distillation for Language Models

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation , author=. arXiv preprint arXiv:2603.05433 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Safechain: Safety of language models with long chain-of-thought reasoning capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[5] [5]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Star-1: Safer alignment of reasoning llms with 1k data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[6] [6]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Safekey: Amplifying aha-moment insights for safety reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[7] [7]

arXiv preprint arXiv:2505.09427 , year=

Safepath: conformal prediction for safe LLM-based autonomous navigation , author=. arXiv preprint arXiv:2505.09427 , year=

work page arXiv

[8] [8]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions , author=. arXiv preprint arXiv:2309.07875 , year=

work page arXiv

[9] [9]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

Are smarter llms safer? exploring safety-reasoning trade-offs in prompting and fine-tuning , author=. arXiv preprint arXiv:2502.09673 , year=

work page arXiv

[11] [11]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

work page arXiv

[12] [12]

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models , author=. arXiv preprint arXiv:2601.23143 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Safety alignment should be made more than just a few tokens deep

Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

work page arXiv

[14] [14]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2025 , note =

NeMo RL: A Scalable and Efficient Post-Training Library , howpublished =. 2025 , note =

work page 2025

[21] [21]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [24]

Advances in Neural Information Processing Systems , volume=

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [25]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024

[26] [26]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

work page

[27] [27]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page

[29] [29]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2312.12321 , year=

Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

work page arXiv

[33] [33]

do anything now

" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2024

[34] [34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[35] [35]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025

[36] [36]

2026 , eprint=

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? , author=. 2026 , eprint=

work page 2026

[37] [37]

2026 , eprint=

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=

work page 2026

[38] [38]

The twelfth international conference on learning representations , year=

Minillm: Knowledge distillation of large language models , author=. The twelfth international conference on learning representations , year=

work page

[39] [39]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

work page

[40] [40]

On-Policy Context Distillation for Language Models

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation , author=. arXiv preprint arXiv:2603.05433 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv