The hidden risks of large reasoning models: A safety assessment of r1

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang · 2025 · arXiv 2502.12659

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

Internalizing Safety Understanding in Large Reasoning Models via Verification

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

Reasoning Structure Matters for Safety Alignment of Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

cs.CL · 2025-08-06 · unverdicted · novelty 6.0

ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

cs.LG · 2025-09-26

citing papers explorer

Showing 8 of 8 citing papers.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models cs.AI · 2025-10-24 · unverdicted · none · ref 20
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models cs.AI · 2026-05-19 · unverdicted · none · ref 41
An attention-guided RL reward combined with diverse persuasion strategies produces higher attack success rates against large reasoning models than prior jailbreak methods.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories cs.AI · 2026-05-09 · unverdicted · none · ref 8
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 30
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
Reasoning Structure Matters for Safety Alignment of Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 26
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 99
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 12
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards cs.LG · 2025-09-26 · unreviewed · ref 42

The hidden risks of large reasoning models: A safety assessment of r1

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer