How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

· 2026 · cs.LG · arXiv 2605.06605

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.

representative citing papers

Online Safety Monitoring for LLMs

cs.AI · 2026-07-02 · unverdicted · novelty 3.0

Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Online Safety Monitoring for LLMs cs.AI · 2026-07-02 · unverdicted · none · ref 7 · internal anchor
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

fields

years

verdicts

representative citing papers

citing papers explorer