pith. sign in

arxiv: 2605.15239 · v1 · pith:HSKNHNGDnew · submitted 2026-05-14 · 💻 cs.LG

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Pith reviewed 2026-05-19 16:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords safety alignmenton-policy distillationself-distillationsafety taxLLM reasoningjailbreak resistancefine-tuning
0
0 comments X

The pith

On-policy self-distillation with privileged safety contexts reduces the safety tax while preserving reasoning in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety alignment for large language models typically improves resistance to harmful queries but degrades reasoning performance, a cost known as the safety tax. The paper argues that off-policy training mismatch is a major driver of this tax because standard methods use demonstrations from humans, external models, or fixed traces instead of the model's own sampled trajectories. It introduces on-policy self-distillation, termed OPSA, in which the model generates its own responses and receives dense per-token supervision from a frozen copy of itself that is conditioned on a privileged safety context. Contexts are selected using a teacher flip rate that identifies those converting unsafe outputs to safe ones, thereby activating latent safety capabilities rather than eliciting superficial demonstrations. Experiments across two model families and five scales show OPSA delivers a stronger safety-reasoning tradeoff than matched off-policy or external-teacher baselines, with the largest benefits on smaller models and robustness to jailbreaks and varying data sizes.

Core claim

The central claim is that training on the model's own rollouts with per-token KL supervision from a frozen teacher copy conditioned on privileged safety contexts, chosen via the teacher flip rate criterion, produces a better safety-reasoning tradeoff than off-policy self-distillation or external-teacher distillation under identical data and full-parameter fine-tuning. This holds across two reasoning-model families and five scales, with gains of +8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B. Token-level analysis shows the updates concentrate near early compliance-decision tokens, which helps maintain general reasoning while improving safety.

What carries the argument

On-policy self-distillation (OPSA) that supplies per-token KL supervision from a frozen teacher on privileged safety contexts, using teacher flip rate to select contexts that flip unsafe student trajectories to safe ones.

If this is right

  • OPSA yields a stronger safety-reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning.
  • The largest gains appear on smaller models, reaching +8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B.
  • The safety and reasoning benefits persist across different training-set sizes and under adaptive jailbreak evaluations.
  • Token-level updates concentrate near early compliance-decision tokens, which supports safety gains without broad degradation of reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-policy mechanism could be tested for other alignment goals such as preference tuning or value learning to see if distributional match reduces other forms of capability tax.
  • Smaller models may see disproportionate benefits from staying on-policy, suggesting a route to safer deployment without needing massive scale.
  • Teacher flip rate could serve as an automatic probe for discovering effective safety prompts or contexts that generalize beyond the training distribution.

Load-bearing premise

The privileged safety context must make the frozen teacher reliably safer than the student trajectory, and the teacher flip rate must identify contexts that activate latent safety reasoning rather than simply producing safe-looking surface demonstrations.

What would settle it

Replacing the teacher flip rate selection with randomly chosen privileged contexts and observing no improvement in the safety-reasoning tradeoff over off-policy baselines under matched training conditions would falsify the claim that the criterion activates useful latent reasoning.

read the original abstract

Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety demonstrations produced by humans, external models, or fixed self-generated traces, rather than on trajectories sampled from its own policy. We identify off-policy training mismatch as a second source of this tax and study on-policy self-distillation for safety alignment, which we call OPSA. The model generates its own rollouts and receives dense per-token KL supervision from a frozen teacher copy of itself conditioned on a privileged safety context. Because this teacher must be safer than the sampled student trajectory, we introduce \emph{teacher flip rate}: a criterion that measures how often a privileged context converts unsafe responses into safe ones. We use this signal to search for contexts that activate latent safety reasoning rather than merely elicit safe-looking demonstrations. Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B). The gains persist across training-set sizes and adaptive jailbreak evaluations. Token-level analyses further show that OPSA concentrates updates near early compliance-decision tokens, providing a mechanism for improving safety while preserving general reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents On-Policy Self-Distillation (OPSA) for LLM safety alignment to reduce the safety tax. It uses on-policy rollouts with per-token KL divergence supervision from a frozen self-teacher conditioned on privileged safety contexts selected via teacher flip rate. The central claim is that OPSA yields superior safety-reasoning tradeoffs compared to off-policy self-distillation and external distillation across model scales, with gains persisting on adaptive jailbreaks.

Significance. Should the central claim hold, the work contributes to practical safety alignment techniques that minimize degradation in reasoning performance. The multi-scale experiments and token-level mechanistic analysis are positive aspects. The method builds on self-distillation ideas but applies them on-policy with a novel selection criterion.

major comments (2)
  1. §3 (teacher flip rate): The teacher flip rate selects contexts where adding the privileged safety context converts unsafe student responses to safe ones. This risks selecting contexts that merely shift the output distribution toward refusals via surface features, rather than activating latent reasoning that the student can apply without the context. The manuscript's claim that this activates latent safety reasoning requires stronger support, such as tests on out-of-distribution prompts or analysis of whether safety improvements transfer independently of the selected contexts.
  2. §4.2 (empirical comparisons): While gains are reported under matched data and full-parameter fine-tuning, the interaction between the flip-rate search (performed on student trajectories) and the training data distribution is not fully analyzed for potential selection bias. This could affect the interpretation of the +8.85 and +5.49 point gains on smaller models.
minor comments (2)
  1. Abstract: Specify the two reasoning-model families explicitly rather than referring to them generically.
  2. Figures: Ensure all plots include error bars or confidence intervals on the reported safety and reasoning metrics to allow assessment of variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment in detail below, drawing on the analyses already present in the work while committing to targeted revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: §3 (teacher flip rate): The teacher flip rate selects contexts where adding the privileged safety context converts unsafe student responses to safe ones. This risks selecting contexts that merely shift the output distribution toward refusals via surface features, rather than activating latent reasoning that the student can apply without the context. The manuscript's claim that this activates latent safety reasoning requires stronger support, such as tests on out-of-distribution prompts or analysis of whether safety improvements transfer independently of the selected contexts.

    Authors: We appreciate the referee's concern that the teacher flip rate could inadvertently favor surface-level refusal patterns. The criterion is explicitly intended to identify contexts that produce safer teacher outputs on the same student-generated trajectory, thereby targeting differences in internal decision-making. Supporting evidence already in the manuscript includes the token-level analysis, which shows that OPSA updates concentrate near early compliance-decision tokens rather than uniformly across the sequence; this localization is consistent with influencing the model's reasoning about compliance rather than post-hoc surface adjustments. In addition, the reported safety gains remain stable under adaptive jailbreak evaluations, which use prompts specifically constructed to bypass safety mechanisms and are therefore unlikely to be explained by context-specific surface features alone. We agree that explicit out-of-distribution prompt tests would provide further corroboration and will incorporate such evaluations in the revised manuscript. revision: yes

  2. Referee: §4.2 (empirical comparisons): While gains are reported under matched data and full-parameter fine-tuning, the interaction between the flip-rate search (performed on student trajectories) and the training data distribution is not fully analyzed for potential selection bias. This could affect the interpretation of the +8.85 and +5.49 point gains on smaller models.

    Authors: We thank the referee for noting the need to examine potential selection effects more explicitly. Because the flip-rate search operates directly on trajectories sampled from the current student policy, the selected contexts are by construction drawn from the same distribution that generates the training rollouts. All baseline comparisons use identical data volumes and the same full-parameter fine-tuning protocol, and the observed improvements hold consistently across five model scales as well as across varying training-set sizes. These controls reduce the scope for distribution-shift artifacts. To address the concern more thoroughly, we will add quantitative statistics on the overlap between flip-rate-selected contexts and the overall training distribution in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical method with external selection criterion and matched comparisons

full rationale

The paper defines OPSA procedurally as on-policy rollouts with per-token KL supervision from a frozen self-teacher under privileged context, using teacher flip rate as an external search criterion to select contexts. Reported gains are empirical outcomes from full-parameter fine-tuning experiments across model scales, training sizes, and adaptive jailbreak evaluations, compared against off-policy and external-teacher baselines under matched data. No equations, derivations, or first-principles claims reduce the safety-reasoning tradeoff improvements to quantities defined by the method's own fitted parameters, self-citations, or internal definitions. The approach remains self-contained against external benchmarks and falsifiable via held-out evaluations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that a privileged safety context can reliably improve safety over the model's own sampled trajectory and that the flip-rate metric selects contexts that activate latent reasoning rather than surface compliance.

free parameters (1)
  • privileged safety context selection
    Contexts are searched using teacher flip rate; exact search procedure and any thresholds are not detailed in abstract.
axioms (1)
  • domain assumption Frozen teacher copy conditioned on privileged safety context produces safer responses than the student policy trajectory.
    Required for the KL supervision signal to be useful; stated as a precondition in the abstract.
invented entities (1)
  • teacher flip rate no independent evidence
    purpose: Quantifies how often a privileged context converts an unsafe response into a safe one to guide context search.
    New metric introduced to distinguish activating contexts from merely safe-looking demonstrations.

pith-pipeline@v0.9.0 · 5825 in / 1336 out tokens · 41487 ms · 2026-05-19T16:29:40.784497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 18 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  3. [3]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=

  4. [4]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Safechain: Safety of language models with long chain-of-thought reasoning capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  5. [5]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Star-1: Safer alignment of reasoning llms with 1k data , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  6. [6]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Safekey: Amplifying aha-moment insights for safety reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  7. [7]

    arXiv preprint arXiv:2505.09427 , year=

    Safepath: conformal prediction for safe LLM-based autonomous navigation , author=. arXiv preprint arXiv:2505.09427 , year=

  8. [8]

    Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

    Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions , author=. arXiv preprint arXiv:2309.07875 , year=

  9. [9]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. arXiv preprint arXiv:2310.03693 , year=

  10. [10]

    Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

    Are smarter llms safer? exploring safety-reasoning trade-offs in prompting and fine-tuning , author=. arXiv preprint arXiv:2502.09673 , year=

  11. [11]

    Safety tax: Safety alignment makes your large reasoning models less reasonable

    Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

  12. [12]

    THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

    THINKSAFE: Self-Generated Safety Alignment for Reasoning Models , author=. arXiv preprint arXiv:2601.23143 , year=

  13. [13]

    Safety alignment should be made more than just a few tokens deep

    Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

  14. [14]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

  15. [15]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  17. [17]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  19. [19]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  20. [20]

    2025 , note =

    NeMo RL: A Scalable and Efficient Post-Training Library , howpublished =. 2025 , note =

  21. [21]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  22. [22]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  26. [26]

    Advances in neural information processing systems , volume=

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

  27. [27]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  28. [28]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  29. [29]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  30. [30]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  31. [31]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  32. [32]

    arXiv preprint arXiv:2312.12321 , year=

    Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

  33. [33]

    do anything now

    " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  34. [34]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  35. [35]

    2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

    Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

  36. [36]

    2026 , eprint=

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? , author=. 2026 , eprint=

  37. [37]

    2026 , eprint=

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe , author=. 2026 , eprint=

  38. [38]

    The twelfth international conference on learning representations , year=

    Minillm: Knowledge distillation of large language models , author=. The twelfth international conference on learning representations , year=

  39. [39]

    The twelfth international conference on learning representations , year=

    On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

  40. [40]

    On-Policy Context Distillation for Language Models

    On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

  41. [41]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation , author=. arXiv preprint arXiv:2603.05433 , year=

  42. [42]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=