pith. machine review for the scientific record. sign in

arxiv: 2605.08930 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Internalizing Safety Understanding in Large Reasoning Models via Verification

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords safety alignmentlarge reasoning modelsverification trainingjailbreak robustnessintrinsic safetyself-critiquechain-of-thoughtreinforcement learning initialization
0
0 comments X

The pith

Training large reasoning models exclusively on safety verification tasks internalizes intrinsic safety understanding that generalizes to resist out-of-domain jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often generate risky answers despite alignment because they lack the ability to evaluate their own outputs for safety. Current methods focus on detecting malicious prompts from outside rather than building internal checks. The SInternal approach addresses this by training models only on tasks that require them to critique the safety of their own responses using expert reasoning trajectories. This verification-focused training produces stronger generalization for safety judgments and greater resistance to adversarial jailbreaks. When used as a starting point for reinforcement learning, it outperforms standard supervised fine-tuning that simply mimics safe behaviors.

Core claim

Current alignment optimizes models to spot bad prompts but leaves them without intrinsic ability to judge the safety of their own generated answers, which makes them vulnerable to jailbreaks. Safety Internal trains large reasoning models solely on safety verification using expert trajectories so they learn to critique their own outputs. This induces generalization in response safety that improves robustness against out-of-domain jailbreaks. When combined with reinforcement learning, the resulting initialization creates a more robust alignment foundation than standard supervised fine-tuning.

What carries the argument

Safety Internal (SInternal), a framework that trains models exclusively on self-verification of response safety using expert reasoning trajectories to internalize safety specifications.

If this is right

  • Learning to verify safety produces generalization in response safety beyond the training distribution.
  • Robustness against out-of-domain jailbreaks increases substantially.
  • SInternal provides a stronger initialization for reinforcement learning than standard supervised fine-tuning.
  • Internalizing safety understanding creates a more robust foundation for alignment than mimicking safe behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Verification training could extend to other properties such as factual accuracy or logical consistency by using similar expert trajectories.
  • Models might self-correct or refuse unsafe generations during inference without needing extra external prompts.
  • The approach could reduce reliance on post-generation moderation systems if the internalized checks prove reliable at scale.

Load-bearing premise

Training exclusively on safety verification tasks with expert reasoning trajectories will produce intrinsic safety understanding that generalizes beyond the training distribution instead of superficial pattern matching or memorization.

What would settle it

If a model trained with SInternal shows no improvement over a standard aligned model when tested on previously unseen jailbreak prompts that target response safety, the generalization claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.08930 by An Zhang, Chaochao Lu, Dongcheng Zhang, Leheng Sheng, Xiang Wang, Yi Zhang, Yuxin Chen.

Figure 1
Figure 1. Figure 1: Answer-centric alignment V.S. intrinsic safety verifica￾tion. Answer-centric alignment imitates safety answer without understanding whether or why they satisfy safety specifications, whereas intrinsic verification equips models with safety verifica￾tion capability for principled safety generations. a query prompt, amplify the potential for real-world risks (Zhou et al., 2025; Ying et al., 2025; Huang et al… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of safety verification capability across align￾ment stages, with Llama-Guard-3-8B used as the external guardrail. Base denotes the pre-alignment model, while SFT and RLVR de￾note post-alignment models. Bars report attack success rate (ASR) for safe generation, and lines report verification F1 scores. where Dprompt is the prompt distribution, and r(x, y) is a safety reward correlated with V(S, x,… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SInternal and follow-up RLVR training framework. (1) SInternal: We collect on-policy response trajectories from the model, where the answers are evaluated by an expert model grounded in safety specification to produce verification criticize reasoning. These trajectories are then used to train the model to explicitly judge the safety of its own answers against specifications. (2) Follow-up RLVR:… view at source ↗
Figure 4
Figure 4. Figure 4: Safety verification prompt template used for expert ver￾ification trajectory construction, where safety specifications are highlighted in red. generates a group of G responses {(zi , yi)} G i=1, where each response consists of reasoning zi and final answer yi . We adopt a simple reward design for safety-related prompts. Each answer is evaluated by an external verifier to obtain a binary reward. We introduc… view at source ↗
Figure 6
Figure 6. Figure 6: Attack success rate vs. fraction of expert trajectory. Results are shown for WildJailbreak and Fortress using DS-14B. our original bullet-point specification in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for over-refusal classification. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Safety verification evaluation template for reasoning trajectories [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of generation safety and verification capability across alignment stages. Bars indicate attack success rate (ASR) for safe generation, while lines show verification F1 scores, evaluated using Qwen3Guard-8B as the external guardrail. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Llama Guard 4 S1–S14 unsafe content taxonomy (?) used as an alternative safety specification in [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study: outputs of SINTERNAL and SINTERNAL + GRPO (both on DS-14B) on a roleplay-based jailbreak targeting S11: Self-Harm. Both models successfully identify the potential risk behind the fictional framing, decline the request, and redirect the user toward on-topic but safe creative alternatives. Bolded spans highlight the risk-identification and refusal cues in each reasoning trace; ellipses denote el… view at source ↗
read the original abstract

While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab-USTC/SInternal

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the SInternal framework for large reasoning models (LRMs). It trains models exclusively on safety verification tasks in which the model critiques its own outputs using expert reasoning trajectories. The central claims are that this procedure internalizes safety specifications, yielding strong generalization to response safety, substantially improved robustness to out-of-domain jailbreaks, and a superior initialization for subsequent reinforcement learning compared with standard supervised fine-tuning.

Significance. If the empirical results hold after rigorous controls for memorization versus internalization, the work would offer a concrete alternative to purely behavioral alignment methods and could improve the robustness of reasoning models under adversarial conditions. The public release of code is a positive contribution to reproducibility.

major comments (3)
  1. [§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.
  2. [§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.
  3. [§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.
minor comments (2)
  1. [§3] Notation for the SInternal training objective and the expert trajectory format should be formalized with equations in §3 to improve clarity.
  2. [Figures in §4] Figure captions should explicitly state the number of runs, random seeds, and statistical significance tests used for all reported metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the empirical rigor and reproducibility of our work. We address each major comment below and commit to the corresponding revisions.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The generalization claim (abstract and §4) requires explicit evidence that performance gains on out-of-domain jailbreaks arise from internalized safety specifications rather than distributional overlap or memorization of the expert trajectories. The current experimental description does not report the diversity of safety rules or jailbreak styles in the verification training set, nor does it include controls (e.g., novel rule combinations or structurally dissimilar attacks) that would distinguish internalization from pattern matching.

    Authors: We agree that distinguishing internalization from memorization or distributional overlap is essential. In the revised manuscript, we will expand the experimental section to report the full diversity of safety rules and jailbreak styles in the verification training set. We will also add new controls, including experiments on novel rule combinations and structurally dissimilar attacks absent from training, to provide direct evidence that robustness gains stem from internalized safety specifications rather than pattern matching. revision: yes

  2. Referee: [§4.3] The claim that SInternal is a superior RL initialization compared with SFT (abstract and §4.3) is load-bearing for the paper’s contribution. The manuscript must report the precise RL algorithm, reward model, number of training steps, and safety metrics both before and after RL fine-tuning so that readers can verify the reported advantage is attributable to the verification pre-training rather than differences in RL hyperparameters or data.

    Authors: We acknowledge that these implementation details are necessary for readers to attribute the advantage correctly. The revised version will specify the exact RL algorithm, reward model, number of training steps, and report safety metrics both before and after RL fine-tuning for SInternal and SFT initializations under identical conditions. This will confirm the benefit arises from the verification pre-training. revision: yes

  3. Referee: [§2 (Motivation / Empirical Analysis)] The abstract states that “ostensibly aligned models lack intrinsic safety understanding” based on an empirical analysis. The paper should specify the exact models, prompts, and verification failure rates used in that analysis (including any quantitative thresholds) so that the baseline weakness being addressed is reproducible and the improvement can be measured against it.

    Authors: We will revise §2 to include the precise models evaluated, the exact prompts used, the observed verification failure rates, and the quantitative thresholds applied in the empirical analysis. These additions will make the baseline reproducible and allow direct measurement of improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces SInternal as a training procedure that fine-tunes LRMs exclusively on safety-verification tasks using expert trajectories, then reports empirical gains in out-of-domain robustness and RL initialization. No equations, fitted parameters, or mathematical derivations appear in the abstract or described framework; the generalization claim is presented as an observed outcome of the training regime rather than a quantity forced by construction from the inputs. No self-citations are invoked to justify uniqueness or to close the argument, and the central premise is externally falsifiable via the reported experiments and released code. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that verification training produces generalizable intrinsic safety understanding. No free parameters or invented entities with independent evidence are specified in the abstract.

axioms (2)
  • domain assumption Ostensibly aligned models lack intrinsic safety understanding and remain vulnerable to adversarial jailbreaks
    Stated as the result of the paper's empirical analysis in the abstract.
  • ad hoc to paper Training exclusively on safety verification tasks induces strong generalization for response safety
    This is the core mechanistic claim of the SInternal framework.
invented entities (1)
  • SInternal framework no independent evidence
    purpose: Internalize safety specifications through verification training
    Newly proposed training procedure without external falsifiable evidence cited in the abstract.

pith-pipeline@v0.9.0 · 5500 in / 1394 out tokens · 58014 ms · 2026-05-12T01:46:05.453988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 14 internal anchors

  1. [1]

    URL https://www.anthropic.com/news/ claude-opus-4-5. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosiute, ...

  2. [2]

    Learning to self-verify makes language models better reasoners

    Chen, Y ., Wang, Y ., Zhang, Y ., Ye, Z., Cai, Z., Shi, Y ., Gu, Q., Su, H., Cai, X., Wang, X., Zhang, A., and Chua, T. Learning to self-verify makes language models better reasoners.CoRR, abs/2602.07594,

  3. [3]

    URL https://ai.google/static/documents/ ai-responsibility-update-published-february-2025. pdf. Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., Heidecke, J., Beutel, A., and Glaese, A. Deliberative alignment: Reasoning enables safer language models.CoRR, abs/2412.16339,

  4. [4]

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, ...

  5. [5]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Hu, J., Zhang, Y ., Han, Q., Jiang, D., Zhang, X., and Shum, H. Open-reasoner-zero: An open source approach to scal- ing up reinforcement learning on the base model.CoRR, abs/2503.24290,

  6. [6]

    Safety tax: Safety alignment makes your large reasoning models less reasonable

    Huang, T., Hu, S., Ilhan, F., Tekin, S. F., Yahn, Z., Xu, Y ., and Liu, L. Safety tax: Safety alignment makes your large reasoning models less reasonable.CoRR, abs/2503.00555,

  7. [7]

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Madry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A. T., Kir- illov, A., Christakis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., ...

  8. [8]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama guard: Llm-based input- output safeguard for human-ai conversations.CoRR, abs/2312.06674,

  9. [9]

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., Iftimie, A., Karpenko, A., Passos, A. T., Neitz, A., Prokofiev, A., Wei, A., Tam, A., Bennett, A., Kumar, A., Saraiva, A., Vallone, A., Duberstein, A., Kondrich, A., Mishchenko, A., Applebaum, A., Jiang, A., Nair, A., Zoph, B., Ghorbani,...

  10. [10]

    Kim, Y ., Kim, T., Park, E., Park, C., Breazeal, C., McDuff, D., and Park, H. W. Invthink: Towards AI safety via inverse reasoning.CoRR, abs/2510.01569,

  11. [11]

    Fortress: Frontier risk evaluation for national security and public safety

    Knight, C. Q., Deshpande, K., Sirdeshmukh, V ., Mankikar, M., Team, S. R., Team, S. R., and Michael, J. FORTRESS: frontier risk evaluation for national security and public safety.CoRR, abs/2506.14922,

  12. [12]

    Kuo, M., Zhang, J., Ding, A., Wang, Q., DiValentin, L., Bao, Y ., Wei, W., Li, H., and Chen, Y . H-cot: Hijack- ing the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.CoRR, abs/2502.12893,

  13. [13]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.CoRR, abs/2305.20050,

  14. [14]

    Large Reasoning Models Learn Better Alignment from Flawed Thinking

    URL https://model-spec.openai.com/ 2025-12-18.html. Accessed: 2026-01-28. Peng, S., Smith, E., Evtimov, I., Jiang, S., Chen, P., Zhan, H., Wang, H., Chau, D. H., Pasupuleti, M., and Chi, J. Large reasoning models learn better alignment from flawed thinking.CoRR, abs/2510.00938,

  15. [15]

    Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

    Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep.CoRR, abs/2406.05946,

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL https://thinkingmachines. ai/blog/lora/. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

  17. [17]

    Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O’Hara, C., Ols- son, C., Petrini...

  18. [18]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  19. [19]

    Alphasteer: Learn- ing refusal steering with principled null-space constraint

    Sheng, L., Shen, C., Zhao, W., Fang, J., Liu, X., Liang, Z., Wang, X., Zhang, A., and Chua, T. Alphasteer: Learn- ing refusal steering with principled null-space constraint. CoRR, abs/2506.07022,

  20. [20]

    Bartoldson, Bhavya Kailkhura, and Cihang Xie

    Wang, Z., Tu, H., Wang, Y ., Wu, J., Mei, J., Bartold- son, B. R., Kailkhura, B., and Xie, C. STAR-1: safer alignment of reasoning llms with 1k data.CoRR, abs/2504.01903,

  21. [21]

    MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety

    Wen, X., He, Z., Qi, H., Wan, Z., Ma, Z., Wen, Y ., Zheng, T., Xu, X., Lu, C., and Zhang, Q. MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety. CoRR, abs/2602.01539,

  22. [22]

    URL https://arxiv.org/abs/ 2512.07761. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M....

  23. [23]

    Ying, Z., Zheng, G., Huang, Y ., Zhang, D., Zhang, W., Zou, Q., Liu, A., Liu, X., and Tao, D. Towards understanding 11 Internalizing Safety Understanding in Large Reasoning Models via Verification the safety boundaries of deepseek models: Evaluation and findings.CoRR, abs/2503.15092,

  24. [24]

    Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

    Yong, Z. and Bach, S. H. Self-jailbreaking: Language models can reason themselves out of safety alignment after benign reasoning training.CoRR, abs/2510.20956,

  25. [25]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y ., Yan, L., Qiao, M., Wu, Y ., and Wang, M. DAPO: an open-sou...

  26. [26]

    From hard refusals to safe-completions: Toward output-centric safety training

    Yuan, Y ., Sriskandarajah, T., Brakman, A., Helyar, A., Beu- tel, A., Vallone, A., and Jain, S. From hard refusals to safe-completions: Toward output-centric safety training. CoRR, abs/2508.09224,

  27. [27]

    The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

    Zhang, J., Wang, H., Smith, E. M., Wang, S., Sharaf, A., Pasupuleti, M., Durme, B. V ., Khashabi, D., Weston, J., and Zhan, H. The alignment waltz: Jointly training agents to collaborate for safety.CoRR, abs/2510.08240, 2025a. Zhang, Y ., Zhang, A., Zhang, X., Sheng, L., Chen, Y ., Liang, Z., and Wang, X. Alphaalign: Incentivizing safety align- ment with ...

  28. [28]

    Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736,

    Zheng, J., Ji, X., Lu, Y ., Cui, C., Zhao, W., Deng, G., Liang, Z., Zhang, A., and Chua, T. Rsafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards.CoRR, abs/2506.07736,

  29. [29]

    Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,

    Zheng, J., Luo, Y ., Xu, J., Liu, B., Chen, Y ., Cui, C., Deng, G., Lu, C., Wang, X., Zhang, A., and Chua, T. Risky- bench: Probing agentic safety risks under real-world de- ployment.CoRR, abs/2602.03100,

  30. [30]

    the most powerful open-source model to date

    Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403. 13372. Zhou, K., Liu, C., Zhao, X., Jangam, S., Srinivasa, J., Liu, G., Song, D., and Wang, X. E. The hidden risks of large reasoning models: A safety assessment of R1.CoRR, abs/2502.12659,

  31. [31]

    Related Works A.1

    12 Internalizing Safety Understanding in Large Reasoning Models via Verification A. Related Works A.1. AI Safety Specification AI safety specifications define the principles governing model behavior, transforming abstract human values into explicit, interpretable rules (e.g., prohibiting harm-enabling content) (Bai et al., 2022; Guan et al., 2024; OpenAI,...

  32. [32]

    is a reasoning-oriented jailbreak benchmark introduced in the Mousetrap framework to study vulnerabilities of reasoning-capable LLMs. It constructs adversarial tasks that manipulate multi-step reasoning processes, demonstrating that enhanced reasoning abilities can amplify susceptibility to logical and cognitive attacks rather than improving safety robust...

  33. [33]

    It is designed to assess the robustness and depth of models’ reasoning abilities

    is a benchmark derived from the American Invitational Mathematics Examination (AIME), containing competition-level mathematical problems that require multi-step reasoning and precise numerical answers. It is designed to assess the robustness and depth of models’ reasoning abilities. For MATH and AIME2024, we report pass@1 and pass@16, respectively, to ens...

  34. [34]

    During training, the maximum prompt length and response length are set to 2048 and 8192 tokens, respectively

    with the GRPO (Shao et al., 2024; Yu et al., 2025). During training, the maximum prompt length and response length are set to 2048 and 8192 tokens, respectively. We use a rollout batch size of 64 prompts with n= 8 responses per prompt, and set the PPO mini-batch size to

  35. [35]

    For math rewards, we compute the reward using strict boxed-answer matching

    For safety reward, we employ Qwen3-Guard-gen as the verification model to provide safety and refusal signals, given its superior performance on safety verification tasks (Zhao et al., 2025). For math rewards, we compute the reward using strict boxed-answer matching. B.5. Baseline Training Settings For a fair comparison, we restrict the training data to pr...