arxiv: 2605.00267 · v2 · submitted 2026-04-30 · 💻 cs.LG · cs.AI· cs.CR

Recognition: unknown

Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu, Jerry Wei, Xuchan Bao, Zihan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords jailbreaklanguage modelscapabilityperformance taxAI safetyinverse scalingfrontier modelssafeguards

0 comments

The pith

The strongest jailbreaks cause almost no capability loss in frontier models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether using jailbreaks to bypass model safeguards reduces the model's ability to solve standard tasks. It finds that this performance cost shrinks as models grow more capable, with the weakest tested model losing over 30 percent on benchmarks while the strongest loses under 8 percent. Advanced jailbreaks in particular preserve nearly full performance while still evading safety filters. A sympathetic reader would care because many current safety arguments assume that jailbroken models become less useful, so a vanishing tax would remove one supposed natural limit on misuse.

Core claim

Jailbroken frontier models retain their capabilities. The performance tax from jailbreaking scales inversely with model capability, so that Haiku 4.5 drops 33.1 percent on average while Opus 4.6 drops only 7.7 percent at maximum thinking effort. Reasoning-heavy tasks degrade more than knowledge-recall tasks across all models. The current strongest jailbreak, Boundary Point Jailbreaking, achieves near-perfect classifier evasion with near-zero degradation on safeguarded models.

What carries the argument

The jailbreak tax: the measured drop in benchmark accuracy when a model is given a jailbreak prompt instead of a normal prompt.

Load-bearing premise

The five chosen benchmarks and 28 tested jailbreaks are representative enough to generalize the inverse-scaling result to real-world tasks and attacks.

What would settle it

A new frontier model or wider task set in which advanced jailbreaks still produce large average benchmark drops would falsify the inverse-scaling claim.

Figures

Figures reproduced from arXiv: 2605.00267 by Daniel Zhu, Jerry Wei, Xuchan Bao, Zihan Wang.

**Figure 2.** Figure 2: Mean relative degradation (% of baseline accuracy lost) by model, averaged across all [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Relative degradation by model for eight representative jailbreaks, averaged across all [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Mean relative degradation by benchmark, ordered left-to-right by increasing reasoning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Relative degradation vs. input tokens added per jailbreak, averaged over all benchmarks and models (cipher jailbreaks purple, non-cipher green). Error bars show ±1 standard error. Degradation rises with jailbreak token count. To understand the interaction between jailbreak complexity and capability degradation, we use the number of input tokens added by a jailbreak as a proxy for its overall complexity.1 … view at source ↗

**Figure 6.** Figure 6: Mean relative degradation across all five benchmarks for Boundary Point Jailbreaking (BPJ) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Classifier evasion rate vs. capability degradation for each jailbreak against the deployed [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Mean relative degradation for BPJ prefixes grouped by classifier score (Sonnet 4.5, GPQA Diamond; x-axis inverted so stronger attacks appear right). Shaded region shows ±1 standard error across prompt strategies. Driving classifier score lower does not increase degradation. One potential hope for defenders is that optimizing attacks for stronger classifier evasion imposes a larger degradation cost on t… view at source ↗

**Figure 9.** Figure 9: Accuracy gap (percentage points) between each prompt strategy and the best-performing [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Per-jailbreak relative degradation (% of baseline accuracy lost) averaged across all five [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Per-model degradation curves. Each model’s jailbreaks are independently sorted by that [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Mean input perplexity vs. relative degradation, averaged across all five benchmarks [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Mean task score on the agentic biology evaluation under five cipher and four non-cipher [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Mean cipher compliance per jailbreak on the agentic evaluation, where compliance is the [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 8.** Figure 8: The wider spread at the lower classifier score end reflects this sampling asymmetry rather [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 15.** Figure 15: Per-prefix relative degradation versus classifier score for 135 individual BPJ prefixes, [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model capability and that the most advanced jailbreaks effectively yield no reduction in model capabilities. Evaluating 28 jailbreaks on five benchmarks across Claude models ranging in capability from Haiku 4.5 to Opus 4.6, we find Haiku 4.5 loses an average of 33.1% on benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. We also observe that across all models, reasoning-heavy tasks display considerably more degradation than knowledge-recall tasks. Finally, Boundary Point Jailbreaking, currently the strongest jailbreak against deployed classifiers, achieves near-perfect classifier evasion with near-zero degradation across safeguarded models. We recommend that safety cases for frontier models should not rely on a meaningful capability degradation from jailbreaks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jailbreaks cost weaker models more performance than frontier ones, but the five benchmarks leave the inverse-scaling claim narrow.

read the letter

The main takeaway is that the performance drop from jailbreaks shrinks as models get stronger, and the best current jailbreaks produce almost no measurable loss on the top Claude models they tested. Haiku 4.5 loses about a third of its benchmark score when jailbroken, while Opus 4.6 with max thinking effort loses only 7.7 percent on average. Reasoning tasks degrade more than knowledge-recall ones across the board, and Boundary Point Jailbreaking evades classifiers with near-zero capability cost on the strongest models. The paper therefore advises that safety arguments should not count on jailbreaks reliably degrading frontier systems. What is new here is the clear inverse relationship between model scale and jailbreak tax, plus the near-zero result for the strongest attacks on the largest models. Earlier papers documented that complex jailbreaks can hurt performance, but this work shows the size of that hit falls as capability rises, backed by the same set of 28 jailbreaks run across the Claude 4.x family. The evaluation is straightforward and reports concrete deltas, which is useful for anyone tracking how safety mechanisms interact with scale. The soft spots sit in the test distribution. Five benchmarks already show larger drops on reasoning-heavy tasks, so if the set under-samples agentic, long-context, or tool-use work, the “no reduction” conclusion for frontier models could be narrower than it appears. Twenty-eight jailbreaks is a solid sample, yet the attack space is bigger, and the abstract does not detail error bars or full controls for benchmark choice. Those limits are real but not fatal for an empirical measurement paper. This is worth reading for AI safety groups that need data on whether capability degradation can be part of a defense story. A reader who cares about empirical scaling trends or jailbreak robustness will find the numbers and the inverse-scaling observation directly usable. The work is coherent on its own terms and engages the relevant prior literature on jailbreak costs, so it deserves a serious referee. I would send it to review and ask for expanded benchmarks and more detail on how the 28 attacks were selected.

Referee Report

2 major / 2 minor

Summary. The paper claims that the performance degradation ('jailbreak tax') from jailbreaks on frontier language models scales inversely with model capability, such that the most advanced jailbreaks produce near-zero capability loss. It supports this with evaluations of 28 jailbreaks on five benchmarks across the Claude 4.x family (Haiku 4.5 to Opus 4.6), reporting average drops of 33.1% for Haiku 4.5 versus 7.7% for Opus 4.6 at maximum thinking effort, greater degradation on reasoning-heavy tasks than knowledge-recall tasks, and near-zero degradation for Boundary Point Jailbreaking while achieving strong classifier evasion. The authors conclude that safety cases should not rely on meaningful capability degradation from jailbreaks.

Significance. If the inverse-scaling result holds beyond the tested distribution, the work is significant for AI safety because it provides concrete empirical counter-evidence to the assumption that jailbreaks inherently trade off capability, potentially shifting evaluation practices away from counting on such degradation. The concrete quantitative deltas across a capability ladder and the identification of a near-zero-tax jailbreak are useful data points for the field.

major comments (2)

[Abstract] Abstract: the central inverse-scaling claim and the 'near-zero degradation' conclusion for Boundary Point Jailbreaking rest on performance deltas measured across only five benchmarks and 28 jailbreaks; the manuscript provides no justification for benchmark selection, no per-benchmark breakdowns, and no discussion of coverage for agentic, long-context, or tool-use capabilities (noted as more affected in reasoning-heavy tasks), which directly limits generalization.
[Abstract] Abstract: the reported averages (33.1% and 7.7%) and the inverse relationship are given without error bars, number of runs, statistical tests, or controls for prompt variation and benchmark selection, preventing assessment of whether the observed scaling is robust or an artifact of the test distribution.

minor comments (2)

[Abstract] The term 'max thinking effort' is used without definition or reference to a specific section describing the inference configuration.
[Abstract] The abstract would benefit from a brief statement of the exact five benchmarks and a high-level categorization of the 28 jailbreaks to improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to improve clarity on benchmark details and statistical reporting.

read point-by-point responses

Referee: [Abstract] Abstract: the central inverse-scaling claim and the 'near-zero degradation' conclusion for Boundary Point Jailbreaking rest on performance deltas measured across only five benchmarks and 28 jailbreaks; the manuscript provides no justification for benchmark selection, no per-benchmark breakdowns, and no discussion of coverage for agentic, long-context, or tool-use capabilities (noted as more affected in reasoning-heavy tasks), which directly limits generalization.

Authors: The five benchmarks were selected to cover both knowledge-recall and reasoning tasks, allowing observation of the differential degradation noted in the paper (see Section 3 for details). Per-benchmark breakdowns appear in the appendix. We will revise the abstract to include a concise justification for benchmark selection and add a limitations discussion on generalization, explicitly noting that agentic, long-context, and tool-use capabilities were not evaluated but that the greater impact on reasoning-heavy tasks may indicate relevance to those areas. This does not weaken the inverse-scaling result on the tested distribution. revision: yes
Referee: [Abstract] Abstract: the reported averages (33.1% and 7.7%) and the inverse relationship are given without error bars, number of runs, statistical tests, or controls for prompt variation and benchmark selection, preventing assessment of whether the observed scaling is robust or an artifact of the test distribution.

Authors: The reported figures are means across the 28 jailbreaks. We will revise the abstract to reference the variance shown in the main results and add error bars to the primary figures. Each jailbreak used a fixed template to control prompt variation, and benchmark selection followed standard practice for capability evaluation. While we did not perform formal statistical tests, the consistent monotonic trend across the model family supports the scaling claim; we will note this explicitly in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper reports direct benchmark evaluations of 28 jailbreaks on five tasks across Claude model variants (Haiku 4.5 to Opus 4.6). The central claims—that jailbreak tax scales inversely with capability and that Boundary Point Jailbreaking yields near-zero degradation—are stated as observed deltas (e.g., 33.1% average loss for Haiku vs. 7.7% for Opus at max thinking). No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The inverse-scaling statement is an empirical pattern extracted from the measured data, not a quantity defined in terms of itself or forced by self-citation. Prior-work citations on the existence of a jailbreak tax are background and do not carry the load-bearing result. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical measurements reported in the abstract. No free parameters are introduced. The main background assumption is that benchmark scores serve as valid proxies for model capabilities.

axioms (1)

domain assumption Benchmark performance scores accurately reflect underlying model capabilities across knowledge-recall and reasoning tasks.
Invoked when interpreting the 33.1% vs 7.7% degradation figures as capability retention.

pith-pipeline@v0.9.0 · 5480 in / 1062 out tokens · 86831 ms · 2026-05-09T19:55:04.729226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 21 canonical work pages · 8 internal anchors

[1]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , year =. Jailbroken: How Does. 2307.02483 , archiveprefix =

work page internal anchor Pith review arXiv
[2]

2023 , eprint =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

2023
[3]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Youliang Yuan and Wenxiang Jiao and Wenxuan Wang and Jen-tse Huang and Pinjia He and Shuming Shi and Zhaopeng Tu , year =. 2308.06463 , archiveprefix =

work page arXiv
[4]

Boundary Point Jailbreaking of Black-Box

Xander Davies and Giorgi Giglemiani and Edmund Lau and Eric Winsor and Geoffrey Irving and Yarin Gal , year =. Boundary Point Jailbreaking of Black-Box. 2602.15001 , archiveprefix =

work page arXiv
[5]

2024 , howpublished =

Godmode Jailbreak , author =. 2024 , howpublished =

2024
[6]

2025 , eprint =

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming , author =. 2025 , eprint =

2025
[7]

2026 , eprint =

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author =. 2026 , eprint =

2026
[8]

2025 , eprint =

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language , author =. 2025 , eprint =

2025
[9]

2505.06108 , archiveprefix =

Lennart Justen , year =. 2505.06108 , archiveprefix =

work page arXiv
[10]

2025 , eprint =

The Jailbreak Tax: How Useful are Your Jailbreak Outputs? , author =. 2025 , eprint =

2025
[11]

Alexandra Souly and Qingyuan Lu and Dillon Bowen and Tu Trinh and Elvis Hsieh and Sana Pandey and Pieter Abbeel and Justin Svegliato and Scott Emmons and Olivia Watkins and Sam Toyer , year =. A. 2402.10260 , archiveprefix =

work page arXiv
[12]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , year =. 2311.12022 , archiveprefix =

work page internal anchor Pith review arXiv
[13]

The wmdp benchmark: Measuring and reduc- ing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024

Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B. Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub T...

work page arXiv
[14]

Virology Capabilities Test (

Jasper G. Virology Capabilities Test (. 2025 , eprint =

2025
[15]

arXiv preprint arXiv:2407.10362 (2024)

Jon M. Laurent and Joseph D. Janizek and Michael Ruzo and Michaela M. Hinks and Michael J. Hammerling and Siddharth Narayanan and Manvitha Ponnapati and Andrew D. White and Samuel G. Rodriques , year =. 2407.10362 , archiveprefix =

work page arXiv
[16]

2023 , month = oct, journal =

Dan Hendrycks and Mantas Mazeika and Thomas Woodside , year =. An Overview of Catastrophic. 2306.12001 , archiveprefix =

work page arXiv
[17]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai and Saurav Kadavath and Sandipan Kundu and Amanda Askell and Jackson Kernion and Andy Jones and Anna Chen and Anna Goldie and Azalia Mirhoseini and Cameron McKinnon and Carol Chen and Catherine Olsson and Christopher Olah and Danny Hernandez and Dawn Drain and Deep Ganguli and Dustin Li and Eli Tran-Johnson and Ethan Perez and Jamie Kerr and Ja...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2023 , eprint =

Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools , author =. 2023 , eprint =

2023
[19]

Mouton and Caleb Lucas and Ella Guest , year =

Christopher A. Mouton and Caleb Lucas and Ella Guest , year =. The Operational Risks of
[20]

2024 , howpublished =

Building an early warning system for. 2024 , howpublished =

2024
[21]

The Reality of

Aidan Peppin and Anka Reuel and Stephen Casper and Elliot Jones and Andrew Strait and Usman Anwar and Anurag Agrawal and Sayash Kapoor and Sanmi Koyejo and Marie Pellat and Rishi Bommasani and Nick Frosst and Sara Hooker , year =. The Reality of. 2412.01946 , archiveprefix =

work page arXiv
[22]

2023 , eprint =

Jailbreaking Black Box Large Language Models in Twenty Queries , author =. 2023 , eprint =

2023
[23]

Tree of Attacks: Jailbreaking Black-Box

Anay Mehrotra and Manolis Zampetakis and Paul Kassianik and Blaine Nelson and Hyrum Anderson and Yaron Singer and Amin Karbasi , year =. Tree of Attacks: Jailbreaking Black-Box. 2312.02119 , archiveprefix =

work page arXiv
[24]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , year =. 2310.04451 , archiveprefix =

work page internal anchor Pith review arXiv
[25]

2024 , howpublished =

Many-shot Jailbreaking , author =. 2024 , howpublished =

2024
[26]

do anything now

Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang , year =. 2308.03825 , archiveprefix =

work page arXiv
[27]

Yi Zeng and Hongpeng Lin and Jingwen Zhang and Diyi Yang and Ruoxi Jia and Weiyan Shi , year =. How. 2401.06373 , archiveprefix =

work page arXiv
[28]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan and Kartikeya Upasani and Jianfeng Chi and Rashi Rungta and Krithika Iyer and Yuning Mao and Michael Tontchev and Qing Hu and Brian Fuller and Davide Testuggine and Madian Khabsa , year =. 2312.06674 , archiveprefix =

work page internal anchor Pith review arXiv
[29]

2023 , eprint =

Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author =. 2023 , eprint =

2023
[30]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey and Eric Wong and Hamed Hassani and George J. Pappas , year =. 2310.03684 , archiveprefix =

work page internal anchor Pith review arXiv
[31]

2023 , eprint =

Detecting Language Model Attacks with Perplexity , author =. 2023 , eprint =

2023
[32]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks , year =. 2402.04249 , archiveprefix =

work page internal anchor Pith review arXiv
[33]

Speak Easy: Eliciting Harmful Jailbreaks from

Yik Siu Chan and Narutatsu Ri and Yuxin Xiao and Marzyeh Ghassemi , year =. Speak Easy: Eliciting Harmful Jailbreaks from. 2502.04322 , archiveprefix =

work page arXiv
[34]

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of

Yu Yan and Sheng Sun and Zhe Wang and Yijun Lin and Zenghao Duan and Zhifei Zheng and Min Liu and Zhiyi Yin and Jianping Zhang , year =. Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of. 2508.16347 , archiveprefix =

work page arXiv
[35]

2022 , eprint =

Ignore Previous Prompt: Attack Techniques For Language Models , author =. 2022 , eprint =

2022
[36]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake and Sahar Abdelnabi and Shailesh Mishra and Christoph Endres and Thorsten Holz and Mario Fritz , year =. Not what you've signed up for: Compromising Real-World. 2302.12173 , archiveprefix =

work page internal anchor Pith review arXiv