arxiv: 2605.09070 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI

Recognition: no theorem link

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

Carsten Maple , Abhishek Kumar , Riya Tapwal

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attacksattack success ratedistributional evaluationvariant sensitivityunion coverageLLM securityadversarial promptingparameterised attacks

0 comments

The pith

Jailbreak attack evaluations must report how success rates vary across parameter settings rather than only the single best result.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that reporting only the highest attack success rate from a few parameter settings fails to capture the full risk of jailbreak attacks on language models. Because success rates fluctuate widely depending on choices like prompts, templates, and conversation rounds, the best-case figure alone does not reveal how often an attack typically works or how much of the unsafe prompt space it can reach. To address this, the authors introduce the Variant Sensitivity Measure to show deviation from average performance and Union Coverage to measure the combined reach across all tested variants. These measures matter for defenders because they indicate whether a reported attack poses a consistent threat or only succeeds in rare configurations. Empirical tests on attacks like PAIR and bijection demonstrate that union coverage often reaches far more prompts than any single configuration.

Core claim

The central claim is that single-configuration attack success rates are insufficient for evaluating parameterised jailbreak attacks because they discard information on typical performance and missed attack surface. Instead, evaluations should include distributional measures: the Variant Sensitivity Measure (VSM), which quantifies how much the best ASR deviates from the mean across variants, and Union Coverage (UC), the fraction of prompts that succeed under at least one configuration. When these are reported, as shown in examples where best ASR is 69% but UC is 88%, defenders gain a clearer picture of the threat.

What carries the argument

The Variant Sensitivity Measure (VSM) and Union Coverage (UC), which together provide a distributional view of attack success rates across parameter variants rather than a single peak value.

If this is right

Comparisons between different jailbreak attacks become more reliable when both peak and average performance are shown.
Defenders can better prioritize resources against attacks whose variants together cover most of the unsafe prompt space.
Future papers should enumerate as many configurations as compute allows and publish the resulting VSM and UC values.
Attack designers can focus on parameters that reduce sensitivity and increase coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This standard could push benchmark suites to include automated sweeps over parameter spaces rather than hand-picked settings.
It implies that real-world red-teaming should treat an attack as a family of variants rather than one fixed method.
Safety evaluations of new models might need to include union coverage estimates to avoid underestimating exposure.

Load-bearing premise

The specific variants tested in the empirical section are representative enough of the full parameter space to make VSM and UC reliable indicators for defenders in general.

What would settle it

An experiment that tests a much larger set of variants for the same attacks and finds that union coverage stays close to the single best ASR across all tested models would show the new measures add little information.

Figures

Figures reproduced from arXiv: 2605.09070 by Abhishek Kumar, Carsten Maple, Riya Tapwal.

read the original abstract

Many jailbreak attack research papers report attack success rates for a limited number of parameter settings, even though there are many combinations of parameter settings that could be used. Further, when new jailbreak papers are released, they often benchmark results against single configurations of existing attacks. This position paper argues such practices are fundamentally insufficient for characterising the threat posed by parameterised jailbreak attacks, and comparing attacks. Most jailbreak attacks expose multiple internal parameters, system prompt templates, conversation rounds, cipher dispersion, teaching shots, and ASR varies substantially across these parameters. Reporting only the best-case configuration discards two pieces of information that defenders genuinely need: how typical that performance is across the variant space, and how much of the attack surface is missed by selecting a single variant. We propose two new measures for jailbreak attacks: the Variant Sensitivity Measure (VSM) and Union Coverage (UC). VSM quantifies how far the best reported ASR deviates from the mean ASR across the tested variant space, UC is the total fraction of prompts resulting in unsafe responses across all tested configurations. We empirically demonstrate the importance of these measures using two attack families across three open-source target models. For PAIR, the best template reaches 69% ASR on Mistral-7B and 75% on Qwen3-0.6B, while UC rises to 88% and 93%, respectively. For bijection on Mistral-7B, the best variant reaches 81% ASR, but the 36-variant union covers 100% of HarmBench-100 prompts. We argue that distributional reporting, publishing VSM alongside ASR and enumerating variant coverage as fully as compute allows, should become the new minimum standard for parameterised jailbreak evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows with concrete numbers that single best ASR understates risk and proposes two simple metrics to fix it, though variant selection remains ad hoc.

read the letter

The main point is that jailbreak papers often report only the strongest single configuration, which hides both how variable performance is and how much of the prompt space gets covered when you try more variants. The authors back this with real runs: on Mistral-7B, PAIR's best template hits 69% ASR but their Union Coverage reaches 88%, and bijection's best of 81% covers 100% of the test prompts across 36 variants. That gap is the useful observation here. They introduce Variant Sensitivity Measure to quantify deviation from the mean and Union Coverage to capture the union of successes, both defined directly over the tested set. These are new and address a real reporting gap in the literature they cite. The empirical section is short but sufficient to illustrate the issue on three open models and two attack families. The soft spot is that both metrics are only as good as the variant set the authors pick. The paper gives no sampling protocol or bounds on the space of templates, rounds, or shots, so different teams could report very different VSM and UC values for the same attack simply by choosing different subsets. They note to enumerate as many as compute allows, which is reasonable, but it leaves the measures non-standardized for now. This is aimed at anyone running or reviewing jailbreak evaluations in AI safety. It is a methodological suggestion rather than a new attack or proof, but the argument holds up on the data shown and the recommendation is practical. It deserves peer review so the community can discuss how to make variant reporting consistent.

Referee Report

2 major / 2 minor

Summary. The paper claims that reporting only the single best attack success rate (ASR) for jailbreak attacks is insufficient, as it discards information on performance variability across parameter settings and the fraction of the attack surface covered. It introduces two new measures—Variant Sensitivity Measure (VSM), quantifying deviation of best ASR from the mean across variants, and Union Coverage (UC), the union of successful attacks over all tested configurations—and demonstrates their value with empirical results on PAIR (best ASR 69% vs. UC 88% on Mistral-7B) and bijection (best ASR 81% vs. UC 100% over 36 variants on Mistral-7B), arguing that distributional reporting should become the new minimum standard.

Significance. If adopted, the measures would give defenders clearer insight into typical attack performance and missed coverage, supported by concrete numerical gaps in the empirical examples. The position is strengthened by the reproducible-style demonstrations on open models and specific attacks, though its broader impact hinges on addressing variant selection.

major comments (2)

[Empirical demonstration] Empirical section (as described in abstract): the reported VSM and UC values (e.g., PAIR on Mistral-7B and bijection over 36 variants) depend on an ad-hoc choice of variants with no protocol given for sampling or bounding the space of templates, rounds, ciphers, or shots. This makes the measures non-comparable across papers and potentially as selective as single-configuration ASR, undermining the claim that they should be a minimum standard.
[Proposal] Proposal of VSM and UC: these quantities are defined only over the researcher-chosen variant set, yet the manuscript provides no guidance on ensuring representativeness or exhaustiveness (e.g., how many variants suffice for reliable UC), leaving the central recommendation vulnerable to the same researcher discretion it criticizes.

minor comments (2)

[Abstract] Abstract: states results across 'three open-source target models' but only names Mistral-7B and Qwen3-0.6B; specify the third model and its results for completeness.
[Proposal] Notation: VSM and UC are introduced without explicit formulas in the provided abstract; including compact definitions (e.g., VSM as max_ASR - mean_ASR) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. Our manuscript is a position paper that uses concrete empirical examples to illustrate why single-configuration ASR is insufficient and to motivate distributional reporting via VSM and UC. We address the two major comments below, clarifying the scope of our claims and indicating revisions that will be incorporated to improve guidance on variant selection while preserving the core argument.

read point-by-point responses

Referee: [Empirical demonstration] Empirical section (as described in abstract): the reported VSM and UC values (e.g., PAIR on Mistral-7B and bijection over 36 variants) depend on an ad-hoc choice of variants with no protocol given for sampling or bounding the space of templates, rounds, ciphers, or shots. This makes the measures non-comparable across papers and potentially as selective as single-configuration ASR, undermining the claim that they should be a minimum standard.

Authors: We agree that the specific variant sets used in the demonstrations (standard PAIR templates and 36 bijection cipher variants) are chosen to reflect common literature practices rather than following a formalized sampling protocol. The examples serve to show that even modest, non-exhaustive exploration reveals substantial gaps between best ASR and UC. To improve comparability and reduce selectivity concerns, we will add a dedicated subsection in the revised manuscript that articulates principles for variant selection: systematic coverage of major parameter dimensions (templates, rounds, ciphers, shots), explicit reporting of the total number of variants tested, and encouragement to expand the set until marginal gains in UC diminish. We do not claim the reported numbers are canonical, but the added guidance will make the measures more transparent and less vulnerable to the criticism of ad-hoc choice. revision: partial
Referee: [Proposal] Proposal of VSM and UC: these quantities are defined only over the researcher-chosen variant set, yet the manuscript provides no guidance on ensuring representativeness or exhaustiveness (e.g., how many variants suffice for reliable UC), leaving the central recommendation vulnerable to the same researcher discretion it criticizes.

Authors: The referee correctly observes that VSM and UC are computed relative to whatever variant set the researcher elects to test. Our position is that requiring these quantities to be reported at all already reduces discretion compared with reporting only the single best ASR. To directly address the need for guidance, the revision will expand the proposal section with practical recommendations: (1) enumerate the parameter axes varied and the rationale for each, (2) report the size of the variant set and the resulting UC, and (3) note that complete exhaustiveness is computationally intractable but that researchers should aim for coverage sufficient to demonstrate stability in the reported statistics. These additions will make the recommendation more actionable without requiring the position paper to solve the open problem of optimal variant sampling. revision: yes

Circularity Check

0 steps flagged

No circularity: VSM and UC are direct definitions over observed ASRs with no self-referential reduction

full rationale

The paper introduces VSM (deviation of best ASR from mean across variants) and UC (union of unsafe prompts over configurations) as explicit new quantities computed from the same ASR data already collected for single-config reporting. These definitions stand alone without equations that equate back to fitted parameters or prior author results by construction. The empirical sections simply tabulate concrete ASR values for PAIR and bijection variants on three models; no derivation step reduces a claimed prediction to an input that was itself defined from the output. No self-citations appear as load-bearing premises, and the central argument is a methodological recommendation rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that ASR varies substantially across parameter settings and that the two new measures capture information defenders need; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Jailbreak attacks expose multiple internal parameters across which ASR varies substantially
This is the explicit premise used to argue that single-configuration reporting is insufficient.

invented entities (2)

Variant Sensitivity Measure (VSM) no independent evidence
purpose: Quantifies how far the best reported ASR deviates from the mean ASR across the tested variant space
Newly defined metric introduced in the paper.
Union Coverage (UC) no independent evidence
purpose: Total fraction of prompts resulting in unsafe responses across all tested configurations
Newly defined metric introduced in the paper.

pith-pipeline@v0.9.0 · 5627 in / 1245 out tokens · 34498 ms · 2026-05-12T02:31:41.754139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 6 internal anchors

[1]

2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

work page 2025
[2]

arXiv preprint arXiv:2410.01294 , year=

Endless jailbreaks with bijection learning , author=. arXiv preprint arXiv:2410.01294 , year=

work page arXiv
[3]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

work page internal anchor Pith review arXiv
[5]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Codeattack: Revealing safety generalization challenges of large language models via code completion , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[8]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Advances in neural information processing systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in neural information processing systems , volume=

work page
[10]

Advances in Neural Information Processing Systems , volume=

Bag of tricks: Benchmarking of jailbreak attacks on llms , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models , author=. arXiv preprint arXiv:2604.16532 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and. Mistral. 2023 , url =

work page 2023