arxiv: 2605.08930 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Internalizing Safety Understanding in Large Reasoning Models via Verification

Yi Zhang , Yuxin Chen , Leheng Sheng , Dongcheng Zhang , Chaochao Lu , Xiang Wang , An Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords safety alignmentlarge reasoning modelsverification trainingjailbreak robustnessintrinsic safetyself-critiquechain-of-thoughtreinforcement learning initialization

0 comments

The pith

Training large reasoning models exclusively on safety verification tasks internalizes intrinsic safety understanding that generalizes to resist out-of-domain jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often generate risky answers despite alignment because they lack the ability to evaluate their own outputs for safety. Current methods focus on detecting malicious prompts from outside rather than building internal checks. The SInternal approach addresses this by training models only on tasks that require them to critique the safety of their own responses using expert reasoning trajectories. This verification-focused training produces stronger generalization for safety judgments and greater resistance to adversarial jailbreaks. When used as a starting point for reinforcement learning, it outperforms standard supervised fine-tuning that simply mimics safe behaviors.

Core claim

Current alignment optimizes models to spot bad prompts but leaves them without intrinsic ability to judge the safety of their own generated answers, which makes them vulnerable to jailbreaks. Safety Internal trains large reasoning models solely on safety verification using expert trajectories so they learn to critique their own outputs. This induces generalization in response safety that improves robustness against out-of-domain jailbreaks. When combined with reinforcement learning, the resulting initialization creates a more robust alignment foundation than standard supervised fine-tuning.

What carries the argument

Safety Internal (SInternal), a framework that trains models exclusively on self-verification of response safety using expert reasoning trajectories to internalize safety specifications.

If this is right

Learning to verify safety produces generalization in response safety beyond the training distribution.
Robustness against out-of-domain jailbreaks increases substantially.
SInternal provides a stronger initialization for reinforcement learning than standard supervised fine-tuning.
Internalizing safety understanding creates a more robust foundation for alignment than mimicking safe behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verification training could extend to other properties such as factual accuracy or logical consistency by using similar expert trajectories.
Models might self-correct or refuse unsafe generations during inference without needing extra external prompts.
The approach could reduce reliance on post-generation moderation systems if the internalized checks prove reliable at scale.

Load-bearing premise

Training exclusively on safety verification tasks with expert reasoning trajectories will produce intrinsic safety understanding that generalizes beyond the training distribution instead of superficial pattern matching or memorization.

What would settle it

If a model trained with SInternal shows no improvement over a standard aligned model when tested on previously unseen jailbreak prompts that target response safety, the generalization claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.08930 by An Zhang, Chaochao Lu, Dongcheng Zhang, Leheng Sheng, Xiang Wang, Yi Zhang, Yuxin Chen.

**Figure 1.** Figure 1: Answer-centric alignment V.S. intrinsic safety verification. Answer-centric alignment imitates safety answer without understanding whether or why they satisfy safety specifications, whereas intrinsic verification equips models with safety verification capability for principled safety generations. a query prompt, amplify the potential for real-world risks (Zhou et al., 2025; Ying et al., 2025; Huang et al… view at source ↗

**Figure 2.** Figure 2: Comparison of safety verification capability across alignment stages, with Llama-Guard-3-8B used as the external guardrail. Base denotes the pre-alignment model, while SFT and RLVR denote post-alignment models. Bars report attack success rate (ASR) for safe generation, and lines report verification F1 scores. where Dprompt is the prompt distribution, and r(x, y) is a safety reward correlated with V(S, x,… view at source ↗

**Figure 3.** Figure 3: Overview of SInternal and follow-up RLVR training framework. (1) SInternal: We collect on-policy response trajectories from the model, where the answers are evaluated by an expert model grounded in safety specification to produce verification criticize reasoning. These trajectories are then used to train the model to explicitly judge the safety of its own answers against specifications. (2) Follow-up RLVR:… view at source ↗

**Figure 4.** Figure 4: Safety verification prompt template used for expert verification trajectory construction, where safety specifications are highlighted in red. generates a group of G responses {(zi , yi)} G i=1, where each response consists of reasoning zi and final answer yi . We adopt a simple reward design for safety-related prompts. Each answer is evaluated by an external verifier to obtain a binary reward. We introduc… view at source ↗

**Figure 6.** Figure 6: Attack success rate vs. fraction of expert trajectory. Results are shown for WildJailbreak and Fortress using DS-14B. our original bullet-point specification in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for over-refusal classification. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Safety verification evaluation template for reasoning trajectories [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of generation safety and verification capability across alignment stages. Bars indicate attack success rate (ASR) for safe generation, while lines show verification F1 scores, evaluated using Qwen3Guard-8B as the external guardrail. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: The Llama Guard 4 S1–S14 unsafe content taxonomy (?) used as an alternative safety specification in [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Case study: outputs of SINTERNAL and SINTERNAL + GRPO (both on DS-14B) on a roleplay-based jailbreak targeting S11: Self-Harm. Both models successfully identify the potential risk behind the fictional framing, decline the request, and redirect the user toward on-topic but safe creative alternatives. Bolded spans highlight the risk-identification and refusal cues in each reasoning trace; ellipses denote el… view at source ↗

read the original abstract

While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab-USTC/SInternal

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Training on safety verification seems to boost jailbreak robustness and RL init, but it's unclear if this reflects genuine internalization or just better pattern matching on the provided trajectories.

read the letter

Here's the quick read on this one. The paper trains large reasoning models to verify the safety of their own outputs using expert trajectories, and claims this internalizes safety understanding enough to improve robustness to unseen jailbreaks and provide a better base for reinforcement learning than plain supervised fine-tuning. What they do well is lay out a clear distinction between external compliance and internal verification, and they back it with some empirical checks showing that standard aligned models don't actually verify their responses well. The focus on verification as the training signal is a fresh angle in the alignment space, and releasing the code lets others dig into the implementation. They also report that this pre-training helps when you later run RL, which aligns with the intuition that a model with some built-in critique ability starts from a stronger place than one that only mimics safe outputs. The soft spot is around the generalization claim. The central test is whether the gains come from real internalized rules or from the model picking up patterns in the expert trajectories that happen to cover the test cases. If the expert data doesn't span a wide enough set of safety scenarios or jailbreak styles, the out-of-domain improvements might not hold up under stronger distribution shifts. The paper would benefit from more detail on how they constructed the trajectories and what controls they ran to separate memorization from understanding. This work is aimed at researchers building safer reasoning systems, especially those experimenting with CoT and verification loops. Anyone following the shift from behavioral to more cognitive alignment approaches will find it relevant. The idea is solid enough and the results are presented with enough specificity to warrant peer review, even if the experiments will need close examination on the generalization front. I'd send it to referees.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the SInternal framework for large reasoning models (LRMs). It trains models exclusively on safety verification tasks in which the model critiques its own outputs using expert reasoning trajectories. The central claims are that this procedure internalizes safety specifications, yielding strong generalization to response safety, substantially improved robustness to out-of-domain jailbreaks, and a superior initialization for subsequent reinforcement learning compared with standard supervised fine-tuning.

Significance. If the empirical results hold after rigorous controls for memorization versus internalization, the work would offer a concrete alternative to purely behavioral alignment methods and could improve the robustness of reasoning models under adversarial conditions. The public release of code is a positive contribution to reproducibility.

major comments (3)

[§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.
[§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.
[§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.

minor comments (2)

[§3] Notation for the SInternal training objective and the expert trajectory format should be formalized with equations in §3 to improve clarity.
[Figures in §4] Figure captions should explicitly state the number of runs, random seeds, and statistical significance tests used for all reported metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the empirical rigor and reproducibility of our work. We address each major comment below and commit to the corresponding revisions.

read point-by-point responses

Referee: [§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.

Authors: We agree that distinguishing internalization from memorization or distributional overlap is essential. In the revised manuscript, we will expand the experimental section to report the full diversity of safety rules and jailbreak styles in the verification training set. We will also add new controls, including experiments on novel rule combinations and structurally dissimilar attacks absent from training, to provide direct evidence that robustness gains stem from internalized safety specifications rather than pattern matching. revision: yes
Referee: [§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.

Authors: We acknowledge that these implementation details are necessary for readers to attribute the advantage correctly. The revised version will specify the exact RL algorithm, reward model, number of training steps, and report safety metrics both before and after RL fine-tuning for SInternal and SFT initializations under identical conditions. This will confirm the benefit arises from the verification pre-training. revision: yes
Referee: [§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.

Authors: We will revise §2 to include the precise models evaluated, the exact prompts used, the observed verification failure rates, and the quantitative thresholds applied in the empirical analysis. These additions will make the baseline reproducible and allow direct measurement of improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces SInternal as a training procedure that fine-tunes LRMs exclusively on safety-verification tasks using expert trajectories, then reports empirical gains in out-of-domain robustness and RL initialization. No equations, fitted parameters, or mathematical derivations appear in the abstract or described framework; the generalization claim is presented as an observed outcome of the training regime rather than a quantity forced by construction from the inputs. No self-citations are invoked to justify uniqueness or to close the argument, and the central premise is externally falsifiable via the reported experiments and released code. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that verification training produces generalizable intrinsic safety understanding. No free parameters or invented entities with independent evidence are specified in the abstract.

axioms (2)

domain assumption Ostensibly aligned models lack intrinsic safety understanding and remain vulnerable to adversarial jailbreaks
Stated as the result of the paper's empirical analysis in the abstract.
ad hoc to paper Training exclusively on safety verification tasks induces strong generalization for response safety
This is the core mechanistic claim of the SInternal framework.

invented entities (1)

SInternal framework no independent evidence
purpose: Internalize safety specifications through verification training
Newly proposed training procedure without external falsifiable evidence cited in the abstract.

pith-pipeline@v0.9.0 · 5500 in / 1394 out tokens · 58014 ms · 2026-05-12T01:46:05.453988+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SInternal trains LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learning to verify induces a strong generalization for response safety

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 14 internal anchors

[1]

URL https://www.anthropic.com/news/ claude-opus-4-5. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosiute, ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Learning to self-verify makes language models better reasoners

Chen, Y ., Wang, Y ., Zhang, Y ., Ye, Z., Cai, Z., Shi, Y ., Gu, Q., Su, H., Cai, X., Wang, X., Zhang, A., and Chua, T. Learning to self-verify makes language models better reasoners.CoRR, abs/2602.07594,

work page arXiv
[3]

URL https://ai.google/static/documents/ ai-responsibility-update-published-february-2025. pdf. Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., and Glaese, A. Deliberative alignment: Reasoning enables safer language models.CoRR, abs/2412.16339,

work page arXiv 2025
[4]

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, ...

work page arXiv
[5]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H. Open-reasoner-zero: An open source approach to scal- ing up reinforcement learning on the base model.CoRR, abs/2503.24290,

work page internal anchor Pith review arXiv
[6]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., Yahn, Z., Xu, Y ., and Liu, L. Safety tax: Safety alignment makes your large reasoning models less reasonable.CoRR, abs/2503.00555,

work page arXiv
[7]

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Madry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A. T., Kir- illov, A., Christakis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input- output safeguard for human-ai conversations.CoRR, abs/2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko, A., Applebaum, A., Jiang, A., Nair, A., Zoph, B., Ghorbani,...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Kim, Y ., Kim, T., Park, E., Park, C., Breazeal, C., McDuff, D., and Park, H. W. Invthink: Towards AI safety via inverse reasoning.CoRR, abs/2510.01569,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Fortress: Frontier risk evaluation for national security and public safety

Knight, C. Q., Deshpande, K., Sirdeshmukh, V ., Mankikar, M., Team, S. R., Team, S. R., and Michael, J. FORTRESS: frontier risk evaluation for national security and public safety.CoRR, abs/2506.14922,

work page arXiv
[12]

Kuo, M., Zhang, J., Ding, A., Wang, Q., DiValentin, L., Bao, Y ., Wei, W., Li, H., and Chen, Y . H-cot: Hijack- ing the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.CoRR, abs/2502.12893,

work page arXiv
[13]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.CoRR, abs/2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Large Reasoning Models Learn Better Alignment from Flawed Thinking

URL https://model-spec.openai.com/ 2025-12-18.html. Accessed: 2026-01-28. Peng, S., Smith, E., Evtimov, I., Jiang, S., Chen, P., Zhan, H., Wang, H., Chau, D. H., Pasupuleti, M., and Chi, J. Large reasoning models learn better alignment from flawed thinking.CoRR, abs/2510.00938,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

work page arXiv
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://thinkingmachines. ai/blog/lora/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O’Hara, C., Ols- son, C., Petrini...

work page arXiv
[18]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Alphasteer: Learn- ing refusal steering with principled null-space constraint

Sheng, L., Shen, C., Zhao, W., Fang, J., Liu, X., Liang, Z., Wang, X., Zhang, A., and Chua, T. Alphasteer: Learn- ing refusal steering with principled null-space constraint. CoRR, abs/2506.07022,

work page arXiv
[20]

Bartoldson, Bhavya Kailkhura, and Cihang Xie

Wang, Z., Tu, H., Wang, Y ., Wu, J., Mei, J., Bartold- son, B. R., Kailkhura, B., and Xie, C. STAR-1: safer alignment of reasoning llms with 1k data.CoRR, abs/2504.01903,

work page arXiv
[21]

MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

Wen, X., He, Z., Qi, H., Wan, Z., Ma, Z., Wen, Y ., Zheng, T., Xu, X., Lu, C., and Zhang, Q. MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety. CoRR, abs/2602.01539,

work page arXiv
[22]

URL https://arxiv.org/abs/ 2512.07761. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M....

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Ying, Z., Zheng, G., Huang, Y ., Zhang, D., Zhang, W., Zou, Q., Liu, A., Liu, X., and Tao, D. Towards understanding 11 Internalizing Safety Understanding in Large Reasoning Models via Verification the safety boundaries of deepseek models: Evaluation and findings.CoRR, abs/2503.15092,

work page arXiv
[24]

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Yong, Z. and Bach, S. H. Self-jailbreaking: Language models can reason themselves out of safety alignment after benign reasoning training.CoRR, abs/2510.20956,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y ., Yan, L., Qiao, M., Wu, Y ., and Wang, M. DAPO: an open-sou...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

From hard refusals to safe-completions: Toward output-centric safety training

Yuan, Y ., Sriskandarajah, T., Brakman, A., Helyar, A., Beu- tel, A., Vallone, A., and Jain, S. From hard refusals to safe-completions: Toward output-centric safety training. CoRR, abs/2508.09224,

work page arXiv
[27]

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Zhang, J., Wang, H., Smith, E. M., Wang, S., Sharaf, A., Pasupuleti, M., Durme, B. V ., Khashabi, D., Weston, J., and Zhan, H. The alignment waltz: Jointly training agents to collaborate for safety.CoRR, abs/2510.08240, 2025a. Zhang, Y ., Zhang, A., Zhang, X., Sheng, L., Chen, Y ., Liang, Z., and Wang, X. Alphaalign: Incentivizing safety align- ment with ...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736,

Zheng, J., Ji, X., Lu, Y ., Cui, C., Zhao, W., Deng, G., Liang, Z., Zhang, A., and Chua, T. Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736,

work page arXiv
[29]

Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,

Zheng, J., Luo, Y ., Xu, J., Liu, B., Chen, Y ., Cui, C., Deng, G., Lu, C., Wang, X., Zhang, A., and Chua, T. Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,

work page arXiv
[30]

the most powerful open-source model to date

Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403. 13372. Zhou, K., Liu, C., Zhao, X., Jangam, S., Srinivasa, J., Liu, G., Song, D., and Wang, X. E. The hidden risks of large reasoning models: A safety assessment of R1.CoRR, abs/2502.12659,

work page arXiv
[31]

Related Works A.1

12 Internalizing Safety Understanding in Large Reasoning Models via Verification A. Related Works A.1. AI Safety Specification AI safety specifications define the principles governing model behavior, transforming abstract human values into explicit, interpretable rules (e.g., prohibiting harm-enabling content) (Bai et al., 2022; Guan et al., 2024; OpenAI,...

work page 2022
[32]

is a reasoning-oriented jailbreak benchmark introduced in the Mousetrap framework to study vulnerabilities of reasoning-capable LLMs. It constructs adversarial tasks that manipulate multi-step reasoning processes, demonstrating that enhanced reasoning abilities can amplify susceptibility to logical and cognitive attacks rather than improving safety robust...

work page 2025
[33]

It is designed to assess the robustness and depth of models’ reasoning abilities

is a benchmark derived from the American Invitational Mathematics Examination (AIME), containing competition-level mathematical problems that require multi-step reasoning and precise numerical answers. It is designed to assess the robustness and depth of models’ reasoning abilities. For MATH and AIME2024, we report pass@1 and pass@16, respectively, to ens...

work page 2022
[34]

During training, the maximum prompt length and response length are set to 2048 and 8192 tokens, respectively

with the GRPO (Shao et al., 2024; Yu et al., 2025). During training, the maximum prompt length and response length are set to 2048 and 8192 tokens, respectively. We use a rollout batch size of 64 prompts with n= 8 responses per prompt, and set the PPO mini-batch size to

work page 2024
[35]

For math rewards, we compute the reward using strict boxed-answer matching

For safety reward, we employ Qwen3-Guard-gen as the verification model to provide safety and refusal signals, given its superior performance on safety verification tasks (Zhao et al., 2025). For math rewards, we compute the reward using strict boxed-answer matching. B.5. Baseline Training Settings For a fair comparison, we restrict the training data to pr...

work page 2025