Online Safety Monitoring for LLMs

Alexander Timans; Christian Naesseth; Eric Nalisnick; Maja Waldron; Metod Jazbec; Mona Schirmer

arxiv: 2607.02510 · v1 · pith:R5HFKZ5Snew · submitted 2026-07-02 · 💻 cs.AI · cs.CL· cs.LG· stat.AP· stat.ML

Online Safety Monitoring for LLMs

Mona Schirmer , Metod Jazbec , Alexander Timans , Christian Naesseth , Maja Waldron , Eric Nalisnick This is my paper

Pith reviewed 2026-07-03 12:50 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGstat.APstat.ML

keywords LLM safetyonline monitoringrisk controlverifier signalssequential hypothesis testingred teamingmathematical reasoning

0 comments

The pith

A simple threshold on an external verifier signal, calibrated by risk control, matches advanced sequential monitors for LLM safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether online safety monitoring for LLMs can rely on a basic real-time design: take a signal from an external verifier model, apply a threshold to trigger alarms, and set that threshold using risk control methods. On mathematical reasoning and red teaming datasets, this approach performs competitively against more complex monitors that use sequential hypothesis testing. A sympathetic reader would care because reliable real-time alarms could let deployed LLMs flag unsafe generations without requiring heavy internal changes or sophisticated statistical machinery at every step.

Core claim

The paper claims that a simple real-time monitor which turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control, is competitive with more advanced monitors based on sequential hypothesis testing on mathematical reasoning and red teaming datasets.

What carries the argument

The simple thresholding monitor with risk-control calibration that converts an external verifier signal into an alarm decision.

If this is right

Simple monitors can achieve comparable safety performance to sequential methods on the tested tasks.
Risk control offers a practical way to set thresholds for target error rates without complex modeling.
The approach applies across both mathematical reasoning and adversarial red teaming scenarios.
Real-time alarm decisions become feasible using only an external signal and a fixed threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the result holds, safety systems could simplify by preferring external verifiers over internal sequential tests.
The method could extend to other generation domains where an external signal is available but full sequential testing is costly.
Deployment would require checking whether the verifier signal distribution remains stable over time.

Load-bearing premise

The external verifier model produces a signal whose distribution allows reliable thresholding and risk-control calibration to control alarm error rates in deployment settings.

What would settle it

A new dataset or deployment scenario where the simple threshold monitor produces alarm error rates that exceed those of sequential hypothesis testing methods or fail to meet the risk-control guarantees.

Figures

Figures reproduced from arXiv: 2607.02510 by Alexander Timans, Christian Naesseth, Eric Nalisnick, Maja Waldron, Metod Jazbec, Mona Schirmer.

**Figure 1.** Figure 1: Monitoring factuality on mathematical reasoning (MATH): CRC and UCB calibrate a single threshold to turn a PRM score into an alarm, yet detect incorrect answers earlier (third row) than e-valuators, which train several density estimators. tive CRC yields higher power than UCB. Turning to detection delay (third row, how quickly the detected sequences are flagged) yields an interesting picture: although the… view at source ↗

**Figure 3.** Figure 3: Signal ablation of token log-probabilities vs. external PRM: False alarm rate, power, and detection delay on Mistral-7B-Instruct (MATH) across target levels ε; solid curves use the Qwen PRM, dashed curves use the per-step minimum token log-probability. The free log-prob signal yields substantially lower power across all four monitors. a second forward pass at every step. A natural question is therefore whe… view at source ↗

**Figure 4.** Figure 4: Monitoring performance when controlling missed detection risk RII (Eq. (7)) instead of false alarm risk RI (Eq. (2)): CRC and UCB control the risk RII well (second column). E-valuators are excluded from this plot as they only allow to control the false alarm rate RI . which guarantees EDcal [RII (λˆCRC)] ≤ ϵ. For control with high probability, let U(λ, n0, δ) be a (1 − δ) upper confidence bound on Rˆ II (λ… view at source ↗

read the original abstract

Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simple threshold monitor competes on the tested datasets but the exchangeability assumption for risk control needs checking against real shifts.

read the letter

The main thing to know is that this paper finds a basic real-time monitor—thresholding an external verifier signal after calibrating the threshold with risk control—performs competitively with sequential hypothesis testing methods on mathematical reasoning and red teaming datasets.

They earn credit for focusing on a practical deployment problem and keeping the method simple. Production teams often prefer approaches that avoid heavy online modeling, and showing that risk control can deliver usable alarm decisions without extra complexity is a useful data point. The choice of datasets aligns with common safety and reasoning evaluation settings.

The abstract supplies no numbers, error bars, or dataset sizes, so the claim of competitiveness is hard to weigh precisely until the full results appear. The central construction depends on the external verifier producing a signal that supports reliable thresholding and calibration. More importantly, risk-control methods typically require exchangeability between the calibration set and deployment stream. LLM outputs frequently show topic drift, user-driven changes, and temporal dependence; if the paper does not test or bound performance under those conditions, the finite-sample guarantees may not carry over to actual online use. That is the main soft spot, and it is worth a direct check rather than a minor note.

The work is aimed at engineers building safety layers for deployed models who need straightforward, maintainable monitors. It engages the relevant literature on monitoring and conformal-style calibration without obvious internal contradictions. I would bring the full version to a reading group once the experimental details and any shift-robustness checks are visible.

It deserves peer review because the empirical comparison is direct and the practical angle is relevant, even if revisions will likely be needed to strengthen the online guarantees.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a simple real-time safety monitor for LLMs that converts a signal from an external verifier model into an alarm via thresholding, with the threshold selected by risk-control calibration. It reports that this monitor is competitive with sequential hypothesis testing monitors on mathematical reasoning and red teaming datasets.

Significance. If the empirical competitiveness holds under the stated calibration, the result would indicate that a lightweight, non-sequential monitor can achieve comparable alarm performance without explicitly modeling temporal structure, which could reduce implementation complexity for online LLM safety systems. The reliance on risk control for finite-sample error-rate bounds is a potential strength if the calibration data satisfy the required assumptions.

major comments (2)

[Abstract] Abstract: the claim of competitiveness is stated without any quantitative results, error bars, dataset sizes, or calibration details, so the support for the central claim cannot be assessed from the provided information.
[Methods] Methods (risk-control calibration procedure): the finite-sample guarantee on alarm error rates rests on exchangeability between the calibration set and deployment streams. LLM outputs routinely exhibit topic drift, user-induced shifts, and temporal dependence; the manuscript does not address whether or how these violations are mitigated, which directly affects whether the simple monitor remains competitive with sequential methods that explicitly handle the online regime.

minor comments (1)

Specify the precise form of the verifier signal (e.g., logit, probability, or binary verdict) and the external model used to produce it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the risk-control assumptions. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of competitiveness is stated without any quantitative results, error bars, dataset sizes, or calibration details, so the support for the central claim cannot be assessed from the provided information.

Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version we will expand the abstract to report key metrics (e.g., false-alarm rates and detection delays with standard errors), the sizes of the mathematical-reasoning and red-teaming datasets, and the calibration-set size together with the target risk level used for threshold selection. revision: yes
Referee: [Methods] Methods (risk-control calibration procedure): the finite-sample guarantee on alarm error rates rests on exchangeability between the calibration set and deployment streams. LLM outputs routinely exhibit topic drift, user-induced shifts, and temporal dependence; the manuscript does not address whether or how these violations are mitigated, which directly affects whether the simple monitor remains competitive with sequential methods that explicitly handle the online regime.

Authors: The referee correctly identifies that the risk-control finite-sample bounds require exchangeability. Our manuscript presents an empirical comparison on the given datasets and does not claim that the bounds continue to hold under arbitrary non-stationarity. We will add an explicit limitations paragraph in the methods and discussion sections that (i) states the exchangeability assumption, (ii) notes that LLM streams can exhibit topic drift and temporal dependence, and (iii) clarifies that the reported competitiveness is an empirical observation rather than a guarantee that survives arbitrary violations. This will also contextualize why sequential methods may retain advantages in strongly non-stationary regimes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison with external components

full rationale

The paper presents an empirical study of a simple thresholding monitor calibrated via risk control, compared against sequential hypothesis testing on two datasets. No derivation chain, equations, or self-citations are invoked to derive the main result; the method treats the external verifier signal and risk-control procedure as independent inputs, and competitiveness is shown via direct experiments rather than any reduction to fitted parameters or prior self-citations. The central claim remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5616 in / 927 out tokens · 41291 ms · 2026-07-03T12:50:34.034019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 31 canonical work pages · 16 internal anchors

[1]

Angelopoulos and Stephen Bates and Adam Fisch and Lihua Lei and Tal Schuster , title =

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control.arXiv preprint arXiv:2208.02814,

work page arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

URL https://assets. anthropic.com/m/99128ddd009bdcb/ Claude-Haiku-4-5-System-Card.pdf. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y ., Madry, A., Zaremba, W., Pachocki, J., and Farhi, D. Mon- itoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Lookback lens: Detecting and mitigat- ing contextual hallucinations in large language models using only attention maps

Chuang, Y .-S., Qiu, L., Hsieh, C.-Y ., Krishna, R., Kim, Y ., and Glass, J. Lookback lens: Detecting and mitigat- ing contextual hallucinations in large language models using only attention maps. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1419–1436,

2024
[6]

Calibrated predictive lower bounds on time-to-unsafe- sampling in llms.arXiv preprint arXiv:2506.13593,

Davidov, H., Feldman, S., Freidkin, G., and Romano, Y . Calibrated predictive lower bounds on time-to-unsafe- sampling in llms.arXiv preprint arXiv:2506.13593,

work page arXiv
[7]

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Feldman, S. and Romano, Y . How many iterations to jail- break? dynamic budget allocation for multi-turn llm eval- uation.arXiv preprint arXiv:2605.06605,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

AI control: Improving safety despite intentional subversion.arXiv preprint arXiv:2312.06942, 2024

Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. Ai control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942,

work page arXiv
[10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

6 Online Safety Monitoring for LLMs Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10:3,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2406.18510 (2024)

URLhttps://arxiv.org/abs/2406.18510. Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page arXiv
[13]

Kaddour, J., Patel, S., Dovonon, G., Richter, L., Minervini, P., and Kusner, M. J. Agentic uncertainty reveals agentic overconfidence.arXiv preprint arXiv:2602.06948,

work page arXiv
[14]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, T., Balesni, M., Barnes, E., Bengio, Y ., Benton, J., Bloom, J., Chen, M., Cooney, A., Dafoe, A., Dra- gan, A., et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Kossen, J., Han, J., Razzak, M., Schut, L., Malik, S., and Gal, Y . Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Efficiently controlling multiple risks with pareto test- ing.arXiv preprint arXiv:2210.07913,

Laufer-Goldshtein, B., Fisch, A., Barzilay, R., and Jaakkola, T. Efficiently controlling multiple risks with pareto test- ing.arXiv preprint arXiv:2210.07913,

work page arXiv
[17]

From judgment to interference: Early stopping llm harmful outputs via streaming content monitoring.arXiv preprint arXiv:2506.09996,

Li, Y ., Sheng, Q., Yang, Y ., Zhang, X., and Cao, J. From judgment to interference: Early stopping llm harmful outputs via streaming content monitoring.arXiv preprint arXiv:2506.09996,

work page arXiv
[18]

Let’s verify step by step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pp. 39578–39601,

2024
[19]

Selfcheckgpt: Zero- resource black-box hallucination detection for genera- tive large language models

Manakul, P., Liusie, A., and Gales, M. Selfcheckgpt: Zero- resource black-box hallucination detection for genera- tive large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 9004–9017,

2023
[20]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- bench: A standardized evaluation framework for auto- mated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

and Hashimoto, T

Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees.arXiv preprint arXiv:2402.10978,

work page arXiv
[22]

com/o3-mini-system-card-feb10.pdf

URL https://cdn.openai. com/o3-mini-system-card-feb10.pdf. Podkopaev, A. and Ramdas, A. Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv preprint arXiv:2110.06177,

work page arXiv
[23]

H., Jaakkola, T

Quach, V ., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling.arXiv preprint arXiv:2306.10193,

work page arXiv
[24]

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Sadhuka, S., Prinster, D., Fannjiang, C., Scalia, G., Regev, A., and Wang, H. E-valuator: Reliable agent veri- fiers with sequential hypothesis testing.arXiv preprint arXiv:2512.03109,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

A., and Nalisnick, E

Schirmer, M., Jazbec, M., Naesseth, C. A., and Nalisnick, E. Monitoring risks in test-time adaptation.arXiv preprint arXiv:2507.08721,

work page arXiv
[26]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Timans, A., Verma, R., Nalisnick, E., and Naesseth, C. A. On continuous monitoring of risk violations under un- known shift.arXiv preprint arXiv:2506.16416,

work page arXiv
[28]

Poskitt, Jun Sun, and Jiali Wei

Wang, H., Poskitt, C. M., Sun, J., and Wei, J. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking.arXiv preprint arXiv:2508.00500,

work page arXiv
[29]

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Wang, X., Suresh, A., Zhang, A., More, R., Jurayj, W., Van Durme, B., Farajtabar, M., Khashabi, D., and Nalis- nick, E. Conformal thinking: Risk control for reasoning on a compute budget.arXiv preprint arXiv:2602.03814,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Ethical and social risks of harm from Language Models

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2310.11986 , year=

Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hen- dricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., et al. Sociotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986,

work page arXiv
[32]

Thought calibration: Efficient and confident test-time scaling

Wu, M., Zhou, C., Bates, S., and Jaakkola, T. Thought calibration: Efficient and confident test-time scaling. In Proceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pp. 14302–14316,

2025
[33]

Pac reasoning: Controlling the performance loss for efficient reasoning.arXiv preprint arXiv:2510.09133,

Zeng, H., Huang, J., Jing, B., Wei, H., and An, B. Pac reasoning: Controlling the performance loss for efficient reasoning.arXiv preprint arXiv:2510.09133,

work page arXiv
[34]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y ., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being

Zhang, Y ., Zhao, D., Hancock, J. T., Kraut, R., and Yang, D. The rise of ai companions: how human- chatbot relationships influence well-being.arXiv preprint arXiv:2506.12605, 2025a. 8 Online Safety Monitoring for LLMs Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models ...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Related Work LLM MonitoringVarious efforts (Weidinger et al., 2021, 2023; Bommasani et al.,

9 Online Safety Monitoring for LLMs A. Related Work LLM MonitoringVarious efforts (Weidinger et al., 2021, 2023; Bommasani et al.,

2021
[37]

have stressed the importance of monitoring LLMs at deployment time, for factuality (Manakul et al., 2023; Chuang et al., 2024), content moderation (Inan et al., 2023; Zeng et al.,

2023
[38]

or against obfuscation (Korbak et al., 2025; Baker et al., 2025; Greenblatt et al., 2023). For content moderation, for instance, monitoring tools are typically taking the form of safeguard classifiers that predict the harmfulness of a user prompt or an LLM output (Inan et al., 2023; Sharma et al., 2025; Mazeika et al., 2024; Markov et al., 2023). Recent w...

2025
[39]

Such oversight models can provide strong signals that can be leveraged within a statistical framework, as we discuss in this work

proposes online monitors that are designed for unfolding output, the setting we consider here. Such oversight models can provide strong signals that can be leveraged within a statistical framework, as we discuss in this work. Statistical Monitoring FrameworksDetecting when a model degrades at inference time or exceeds a pre-defined risk level has traditio...

2021
[40]

Relatedly, e-processes (Ramdas et al., 2023; Ramdas & Wang,

labels, as well as under model adaptation (Schirmer et al., 2025; Bar et al., 2024). Relatedly, e-processes (Ramdas et al., 2023; Ramdas & Wang,

2025
[41]

have been used to track evidence of risk violations over time (Timans et al., 2025; Prinster et al., 2025). In the context of LLMs, conformal prediction has been used to provide statistical trustworthiness, albeit in a classic offline setting (Cherian et al., 2024; Mohri & Hashimoto, 2024; Quach et al., 2023; Gui et al., 2024). A related line of work on o...

2025
[42]

Most closely related to our work are agentic oversight methods (Wang et al., 2025), particularly Sadhuka et al

uses conformal survival analysis to construct PAC-type bounds on the time-to-unsafe-sampling of a given prompt. Most closely related to our work are agentic oversight methods (Wang et al., 2025), particularly Sadhuka et al. (2025), which tackles the same online monitoring setting. B. Missed Detection Risk We complement the false alarm risk of § 3.3 with i...

2025
[43]

(7)) instead of false alarm risk RI (Eq

For control in expectation, the threshold is ˆλCRC := min λ∈Λ : n0 n0 + 1 ˆRII (λ;D cal) + 1 n0 + 1 ≤ϵ ,(8) 10 Online Safety Monitoring for LLMs Figure 4.Monitoring performance when controlling missed detection risk RII (Eq. (7)) instead of false alarm risk RI (Eq. (2)): CRC and UCB control the risk RII well (second column). E-valuators are excluded from ...

2025
[44]

incorrect

, i.e. incorrect. For a new LLM trajectory, the procedure evaluates a statistic Mt online, and rejects the null once Mt crosses a threshold cα. The target guarantee isanytime-validfalse-alarm control, or PH0 ∃t≥1 :M t ≥c α ≤α, meaning that a true successful trajectory is incorrectly flagged at any time with probability at most α, for any unknown sequence ...

2025

[1] [1]

Angelopoulos and Stephen Bates and Adam Fisch and Lihua Lei and Tal Schuster , title =

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control.arXiv preprint arXiv:2208.02814,

work page arXiv

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

URL https://assets. anthropic.com/m/99128ddd009bdcb/ Claude-Haiku-4-5-System-Card.pdf. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y ., Madry, A., Zaremba, W., Pachocki, J., and Farhi, D. Mon- itoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Lookback lens: Detecting and mitigat- ing contextual hallucinations in large language models using only attention maps

Chuang, Y .-S., Qiu, L., Hsieh, C.-Y ., Krishna, R., Kim, Y ., and Glass, J. Lookback lens: Detecting and mitigat- ing contextual hallucinations in large language models using only attention maps. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1419–1436,

2024

[6] [6]

Calibrated predictive lower bounds on time-to-unsafe- sampling in llms.arXiv preprint arXiv:2506.13593,

Davidov, H., Feldman, S., Freidkin, G., and Romano, Y . Calibrated predictive lower bounds on time-to-unsafe- sampling in llms.arXiv preprint arXiv:2506.13593,

work page arXiv

[7] [7]

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Feldman, S. and Romano, Y . How many iterations to jail- break? dynamic budget allocation for multi-turn llm eval- uation.arXiv preprint arXiv:2605.06605,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

AI control: Improving safety despite intentional subversion.arXiv preprint arXiv:2312.06942, 2024

Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. Ai control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942,

work page arXiv

[10] [10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

6 Online Safety Monitoring for LLMs Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10:3,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2406.18510 (2024)

URLhttps://arxiv.org/abs/2406.18510. Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page arXiv

[13] [13]

Kaddour, J., Patel, S., Dovonon, G., Richter, L., Minervini, P., and Kusner, M. J. Agentic uncertainty reveals agentic overconfidence.arXiv preprint arXiv:2602.06948,

work page arXiv

[14] [14]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, T., Balesni, M., Barnes, E., Bengio, Y ., Benton, J., Bloom, J., Chen, M., Cooney, A., Dafoe, A., Dra- gan, A., et al. Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv:2507.11473,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Kossen, J., Han, J., Razzak, M., Schut, L., Malik, S., and Gal, Y . Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Efficiently controlling multiple risks with pareto test- ing.arXiv preprint arXiv:2210.07913,

Laufer-Goldshtein, B., Fisch, A., Barzilay, R., and Jaakkola, T. Efficiently controlling multiple risks with pareto test- ing.arXiv preprint arXiv:2210.07913,

work page arXiv

[17] [17]

From judgment to interference: Early stopping llm harmful outputs via streaming content monitoring.arXiv preprint arXiv:2506.09996,

Li, Y ., Sheng, Q., Yang, Y ., Zhang, X., and Cao, J. From judgment to interference: Early stopping llm harmful outputs via streaming content monitoring.arXiv preprint arXiv:2506.09996,

work page arXiv

[18] [18]

Let’s verify step by step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pp. 39578–39601,

2024

[19] [19]

Selfcheckgpt: Zero- resource black-box hallucination detection for genera- tive large language models

Manakul, P., Liusie, A., and Gales, M. Selfcheckgpt: Zero- resource black-box hallucination detection for genera- tive large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pp. 9004–9017,

2023

[20] [20]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- bench: A standardized evaluation framework for auto- mated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

and Hashimoto, T

Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees.arXiv preprint arXiv:2402.10978,

work page arXiv

[22] [22]

com/o3-mini-system-card-feb10.pdf

URL https://cdn.openai. com/o3-mini-system-card-feb10.pdf. Podkopaev, A. and Ramdas, A. Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv preprint arXiv:2110.06177,

work page arXiv

[23] [23]

H., Jaakkola, T

Quach, V ., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling.arXiv preprint arXiv:2306.10193,

work page arXiv

[24] [24]

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Sadhuka, S., Prinster, D., Fannjiang, C., Scalia, G., Regev, A., and Wang, H. E-valuator: Reliable agent veri- fiers with sequential hypothesis testing.arXiv preprint arXiv:2512.03109,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

A., and Nalisnick, E

Schirmer, M., Jazbec, M., Naesseth, C. A., and Nalisnick, E. Monitoring risks in test-time adaptation.arXiv preprint arXiv:2507.08721,

work page arXiv

[26] [26]

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming.arXiv preprint arXiv:2501.18837,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Timans, A., Verma, R., Nalisnick, E., and Naesseth, C. A. On continuous monitoring of risk violations under un- known shift.arXiv preprint arXiv:2506.16416,

work page arXiv

[28] [28]

Poskitt, Jun Sun, and Jiali Wei

Wang, H., Poskitt, C. M., Sun, J., and Wei, J. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking.arXiv preprint arXiv:2508.00500,

work page arXiv

[29] [29]

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Wang, X., Suresh, A., Zhang, A., More, R., Jurayj, W., Van Durme, B., Farajtabar, M., Khashabi, D., and Nalis- nick, E. Conformal thinking: Risk control for reasoning on a compute budget.arXiv preprint arXiv:2602.03814,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Ethical and social risks of harm from Language Models

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2310.11986 , year=

Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hen- dricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., et al. Sociotechnical safety evaluation of generative ai systems.arXiv preprint arXiv:2310.11986,

work page arXiv

[32] [32]

Thought calibration: Efficient and confident test-time scaling

Wu, M., Zhou, C., Bates, S., and Jaakkola, T. Thought calibration: Efficient and confident test-time scaling. In Proceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pp. 14302–14316,

2025

[33] [33]

Pac reasoning: Controlling the performance loss for efficient reasoning.arXiv preprint arXiv:2510.09133,

Zeng, H., Huang, J., Jing, B., Wei, H., and An, B. Pac reasoning: Controlling the performance loss for efficient reasoning.arXiv preprint arXiv:2510.09133,

work page arXiv

[34] [34]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Zeng, W., Liu, Y ., Mullins, R., Peran, L., Fernandez, J., Harkous, H., Narasimhan, K., Proud, D., Kumar, P., Radharapu, B., et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

The Rise of AI Companions: Interaction with AI Companions and Psychological Well-being

Zhang, Y ., Zhao, D., Hancock, J. T., Kraut, R., and Yang, D. The rise of ai companions: how human- chatbot relationships influence well-being.arXiv preprint arXiv:2506.12605, 2025a. 8 Online Safety Monitoring for LLMs Zhang, Z., Zheng, C., Wu, Y ., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., and Lin, J. The lessons of developing process reward models ...

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Related Work LLM MonitoringVarious efforts (Weidinger et al., 2021, 2023; Bommasani et al.,

9 Online Safety Monitoring for LLMs A. Related Work LLM MonitoringVarious efforts (Weidinger et al., 2021, 2023; Bommasani et al.,

2021

[37] [37]

have stressed the importance of monitoring LLMs at deployment time, for factuality (Manakul et al., 2023; Chuang et al., 2024), content moderation (Inan et al., 2023; Zeng et al.,

2023

[38] [38]

or against obfuscation (Korbak et al., 2025; Baker et al., 2025; Greenblatt et al., 2023). For content moderation, for instance, monitoring tools are typically taking the form of safeguard classifiers that predict the harmfulness of a user prompt or an LLM output (Inan et al., 2023; Sharma et al., 2025; Mazeika et al., 2024; Markov et al., 2023). Recent w...

2025

[39] [39]

Such oversight models can provide strong signals that can be leveraged within a statistical framework, as we discuss in this work

proposes online monitors that are designed for unfolding output, the setting we consider here. Such oversight models can provide strong signals that can be leveraged within a statistical framework, as we discuss in this work. Statistical Monitoring FrameworksDetecting when a model degrades at inference time or exceeds a pre-defined risk level has traditio...

2021

[40] [40]

Relatedly, e-processes (Ramdas et al., 2023; Ramdas & Wang,

labels, as well as under model adaptation (Schirmer et al., 2025; Bar et al., 2024). Relatedly, e-processes (Ramdas et al., 2023; Ramdas & Wang,

2025

[41] [41]

have been used to track evidence of risk violations over time (Timans et al., 2025; Prinster et al., 2025). In the context of LLMs, conformal prediction has been used to provide statistical trustworthiness, albeit in a classic offline setting (Cherian et al., 2024; Mohri & Hashimoto, 2024; Quach et al., 2023; Gui et al., 2024). A related line of work on o...

2025

[42] [42]

Most closely related to our work are agentic oversight methods (Wang et al., 2025), particularly Sadhuka et al

uses conformal survival analysis to construct PAC-type bounds on the time-to-unsafe-sampling of a given prompt. Most closely related to our work are agentic oversight methods (Wang et al., 2025), particularly Sadhuka et al. (2025), which tackles the same online monitoring setting. B. Missed Detection Risk We complement the false alarm risk of § 3.3 with i...

2025

[43] [43]

(7)) instead of false alarm risk RI (Eq

For control in expectation, the threshold is ˆλCRC := min λ∈Λ : n0 n0 + 1 ˆRII (λ;D cal) + 1 n0 + 1 ≤ϵ ,(8) 10 Online Safety Monitoring for LLMs Figure 4.Monitoring performance when controlling missed detection risk RII (Eq. (7)) instead of false alarm risk RI (Eq. (2)): CRC and UCB control the risk RII well (second column). E-valuators are excluded from ...

2025

[44] [44]

incorrect

, i.e. incorrect. For a new LLM trajectory, the procedure evaluates a statistic Mt online, and rejects the null once Mt crosses a threshold cα. The target guarantee isanytime-validfalse-alarm control, or PH0 ∃t≥1 :M t ≥c α ≤α, meaning that a true successful trajectory is incorrectly flagged at any time with probability at most α, for any unknown sequence ...

2025