Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu · 2025 · arXiv 2510.06036

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.

citing papers explorer

Showing 2 of 2 citing papers.

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures cs.AI · 2026-05-28 · unverdicted · none · ref 28
TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands cs.LG · 2026-05-14 · unverdicted · none · ref 61
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.

Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025

fields

years

verdicts

representative citing papers

citing papers explorer