TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.
Refusal falls off a cliff: How safety alignment fails in reasoning?arXiv preprint arXiv:2510.06036, October 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.
citing papers explorer
-
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
TLO is a logit-based diagnostic that visualizes temporal patterns of LLM jailbreak failures on a calibrated 2D plane, distinguishing attacks with identical ASR and enabling early stopping that reduces successful jailbreaks by more than half.
-
Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
Behavioral assurance is structurally unable to verify the latent safety properties demanded by AI governance frameworks enacted 2019-2026.