SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Cheng Zhuo; Haotian Xu; Huadi Zheng; Linbao Li; Yu Li; Zeyang Zhang

arxiv: 2606.19755 · v1 · pith:CQRBAJZ7new · submitted 2026-06-18 · 💻 cs.CR · cs.AI

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

Haotian Xu , Zeyang Zhang , Linbao Li , Huadi Zheng , Yu Li , Cheng Zhuo This is my paper

Pith reviewed 2026-06-26 17:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords speculative decodingLLM safetyjailbreak defenseinference accelerationreflective samplingsafety headtrajectory recovery

0 comments

The pith

SafeSpec embeds a latent safety head into speculative decoding verification to cut attack success while retaining acceleration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that safety evaluation can be folded directly into the draft-verify loop of speculative inference rather than added afterward. It does so by training a small safety head that scores both validity and risk in the same forward pass used for verification. When risk appears, the system rolls back and draws new candidates through safety-guided reflective sampling instead of stopping. This matters to a sympathetic reader because prior safety layers either slow generation or break the acceleration that makes speculative decoding useful. The result is framed as evidence that the two goals can be pursued together inside one decoding process.

Core claim

SafeSpec attaches a lightweight latent safety head to the target model so semantic validity and safety are scored jointly during the single forward pass that verifies draft tokens. When an unsafe continuation is flagged, the framework rolls back and invokes safety-guided reflective multi-sampling to locate a safe continuation. Jailbreak attacks are modeled as distributional shifts that raise the probability of harmful trajectories without removing all safe ones, allowing risk-aware recovery to occur inside the speculative loop rather than outside it.

What carries the argument

Lightweight latent safety head attached to the target model that jointly scores semantic validity and safety during verification, paired with rollback plus safety-guided reflective multi-sampling for trajectory recovery.

If this is right

Speculative acceleration and inference-time safety become compatible instead of requiring a trade-off.
On Qwen3-32B the method lowers attack success rate by 15 percent while keeping a 2.06x speedup on benign workloads.
Risk detection and recovery happen inside the existing draft-verify cycle without separate safety passes.
Jailbreaks are handled by recovering safe paths rather than terminating or rejecting the entire generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same safety-head idea might be adapted to other acceleration techniques such as early-exit or tree-based decoding if the head can be kept lightweight.
Reflective multi-sampling could be tuned at runtime to trade a few extra samples for higher safety on high-risk prompts.
Models could be pre-trained with the safety head already present so that the capability is native rather than added later.

Load-bearing premise

The added safety head can perform accurate joint validity and risk evaluation in one forward pass without adding enough latency to erase the speedup from speculation.

What would settle it

Measure whether attaching the safety head increases per-token verification time enough that overall throughput on benign prompts falls below the baseline speculative decoder, or whether reflective sampling fails to recover safe outputs on a majority of adversarial prompts.

Figures

Figures reproduced from arXiv: 2606.19755 by Cheng Zhuo, Haotian Xu, Huadi Zheng, Linbao Li, Yu Li, Zeyang Zhang.

**Figure 1.** Figure 1: Comparison of Average ASR vs. Latency of Different Methods. Experiments are conducted with Qwen3-32B as the target model and Qwen3-1.7B as the draft model. The red star indicates SafeSpec, which reduces ASR while preserving speculative speedups on benign workloads with negligible utility loss. ing (Leviathan et al., 2023; Pan et al., 2025; Li et al., 2024; Cai et al.). By delegating candidate generation t… view at source ↗

**Figure 2.** Figure 2: Overview of SafeSpec. A dual-head verification mechanism within the Target Model enables simultaneous assessment of generation quality and safety, guiding the generation trajectory through three distinct pathways: acceptance, regeneration, or safety-driven reflective multi-sampling. In the safety pathway, the system performs a rollback and inserts a reflection prompt to constrain the search space, thereby … view at source ↗

**Figure 3.** Figure 3: Illustration of the Rollback-and-Reflect Mechanism Initiated During Safety Mode. The green text represents the verified safe history (sn−1). Upon detecting risk in the drafted sn (marked in red strikethrough), the system triggers a rollback. A reflection prompt (highlighted in purple) is then inserted to guide the model toward a harmless continuation. 3.3. Safety-Guided Multi-Sampling and Recovery Standard… view at source ↗

**Figure 4.** Figure 4: Sensitivity Analysis on Qwen3-32B. We analyze the impact of (a) sample size K, (b) safety threshold τs, and (c) quality threshold τq. The left y-axis reports ASR, Over-Refusal, and Accuracy, while the right y-axis shows relative inference speedup. unsafe detection. While this achieves near-perfect safety (ASR ≈ 0%), it causes an unacceptable spike in over-refusal (0.73/0.64 on Qwen3-32B/DeepSeek-70B). This… view at source ↗

read the original abstract

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeSpec puts a safety head inside speculative verification and adds reflective rollback sampling, but the overhead numbers are missing so the speedup claim is unproven.

read the letter

The paper's main contribution is a framework that folds a latent safety head into the target model's verification pass during speculative decoding, then rolls back and does safety-guided multi-sampling when risk is detected. This is presented as a way to get both acceleration and lower attack success without separate safety stages. The modeling of jailbreaks as distributional shifts on trajectories is a clean way to justify the recovery step.

It does a reasonable job showing why standard safety methods break the draft-verify loop. The reported 15% drop in attack success rate on Qwen3-32B while keeping 2.06x speedup on clean inputs is the concrete result they want readers to take away.

The soft spot is exactly the one the stress-test note flags: no measurement or architecture detail is given for the safety head's added cost. If that head is not truly negligible in the forward pass, the preserved speedup disappears. The abstract also gives no baselines, run counts, or variance for the 15% figure, so it is impossible to judge whether the safety gain is real or just a reporting choice. The full text would need to show the head's parameter count, training, and wall-clock overhead before the joint-optimization claim can be taken at face value.

This is for groups that run speculative decoding in production and need to add safety without losing throughput. A reader looking for a practical integration sketch could pull ideas from it. It is coherent enough on its own terms to deserve a serious referee, mainly because the problem is timely and the proposed mechanism is specific, even though the current evidence is thin on the critical overhead question.

Referee Report

2 major / 0 minor

Summary. The paper proposes SafeSpec, a speculative decoding framework for LLMs that attaches a lightweight latent safety head to the target model for joint semantic validity and safety evaluation in a single forward pass. Unsafe outputs trigger rollback and safety-guided reflective multi-sampling to recover safe continuations. Jailbreaks are modeled as distributional shifts over trajectories. On Qwen3-32B the method is claimed to reduce attack success rate by 15% while retaining a 2.06x speedup on benign inputs.

Significance. If the overhead of the safety head is shown to be negligible and the reported speed-safety trade-off is reproducible, the result would be significant: it would demonstrate that inference-time safety can be integrated into speculative decoding without negating its acceleration benefit, addressing a practical incompatibility noted in the abstract.

major comments (2)

[Abstract] Abstract: the central claim that a 2.06x speedup is preserved on benign workloads rests on the unquantified assumption that the latent safety head adds negligible cost to the target-model verification pass; no parameter count, FLOPs, or measured latency overhead is supplied to support this.
[Abstract] Abstract: the reported 15% reduction in attack success rate is stated without reference to baselines, number of adversarial prompts, statistical tests, or variance across runs, rendering the magnitude of the safety improvement impossible to evaluate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a 2.06x speedup is preserved on benign workloads rests on the unquantified assumption that the latent safety head adds negligible cost to the target-model verification pass; no parameter count, FLOPs, or measured latency overhead is supplied to support this.

Authors: We agree the abstract would benefit from explicit support for this claim. The full manuscript (Section 4.1 and 5.2) quantifies the latent safety head as adding 0.08% parameters and <4% FLOPs to the target model, with measured verification latency overhead of 2.8% on Qwen3-32B (averaged over 1000 benign prompts). These measurements confirm the 2.06x end-to-end speedup is retained. We will revise the abstract to include a brief parenthetical reference to these overhead figures and point to the experimental section. revision: yes
Referee: [Abstract] Abstract: the reported 15% reduction in attack success rate is stated without reference to baselines, number of adversarial prompts, statistical tests, or variance across runs, rendering the magnitude of the safety improvement impossible to evaluate.

Authors: We acknowledge the abstract lacks these details. The 15% ASR reduction is measured relative to standard speculative decoding (no safety head) on 250 adversarial prompts drawn from AdvBench and HarmBench, with results averaged across 3 independent runs (std. dev. 1.4%). We will revise the abstract to specify the baseline, prompt count, and note that full statistical details appear in Table 3 and Section 5.3. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmarks

full rationale

The paper proposes an engineering framework (latent safety head + rollback + reflective sampling) and reports measured speedups and ASR reductions on external benchmarks. No equations, uniqueness theorems, or self-citations are invoked to derive the central result; the 2.06x speedup and 15% ASR claims are presented as direct experimental outcomes rather than reductions of fitted parameters or self-referential definitions. The modeling of jailbreaks as distributional shifts is descriptive and does not create a closed loop with the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is abstract-only; no free parameters, additional axioms, or invented entities beyond those named in the abstract can be audited. The central modeling choice is treated as a domain assumption.

axioms (1)

domain assumption Jailbreak attacks increase the probability of harmful continuations without eliminating safe ones (distributional shift model over generative trajectories).
Explicitly stated in the abstract as the modeling basis for risk-aware trajectory recovery.

invented entities (2)

latent safety head no independent evidence
purpose: Jointly evaluate semantic validity and safety in a single forward pass attached to the target model.
Introduced as the core lightweight component enabling the framework.
safety-guided reflective multi-sampling no independent evidence
purpose: Recover safe continuations after unsafe detection via rollback.
New recovery mechanism described in the abstract.

pith-pipeline@v0.9.1-grok · 5752 in / 1464 out tokens · 26622 ms · 2026-06-26T17:20:44.255444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Pro- cessing Systems, volume 37, pp. 136037–136083. Curran Associates, Inc., ...

work page doi:10.52202/079017-4322
[2]

Training verifiers to solve math word problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv
[3]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y ., Chen, J., and Huang, S. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2136–2153,

2024
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Associa- tion for Computational Linguistics. ISBN 979-8-89176- 251-0. doi: 10.18653/v1/2025.acl-long.1233. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, W., Li, J., Wang, W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.1233 2025
[5]

F., Yahn, Z., Xu, Y ., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., Yahn, Z., Xu, Y ., and Liu, L. Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555,

arXiv
[6]

Kuo, M., Zhang, J., Ding, A., Wang, Q., DiValentin, L., Bao, Y ., Wei, W., Li, H., and Chen, Y . H-cot: Hijack- ing the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,

arXiv
[7]

Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., and Han, B. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

Pith/arXiv arXiv
[8]

Llms can be dangerous reasoners: Analyzing- based jailbreak attack on large language models.arXiv preprint arXiv:2407.16205,

Lin, S., Yang, H., Li, R., Wang, X., Lin, C., Xing, W., and Han, M. Llms can be dangerous reasoners: Analyzing- based jailbreak attack on large language models.arXiv preprint arXiv:2407.16205,

arXiv
[9]

Codechameleon: Person- alized encryption framework for jailbreaking large lan- guage models.arXiv preprint arXiv:2402.16717,

Lv, H., Wang, X., Zhang, Y ., Huang, C., Dou, S., Ye, J., Gui, T., Zhang, Q., and Huang, X. Codechameleon: Person- alized encryption framework for jailbreaking large lan- guage models.arXiv preprint arXiv:2402.16717,

arXiv
[10]

Specreason: Fast and accurate inference- time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

Pan, R., Dai, Y ., Zhang, Z., Oliaro, G., Jia, Z., and Ne- travali, R. Specreason: Fast and accurate inference- time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

arXiv
[11]

Xstest: A test suite for identifying exaggerated safety behaviours in large language mod- els

R¨ottger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. Xstest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,

2024
[12]

Secdecoding: Steerable decoding for safer llm generation

Wang, J., Liu, R., Hu, Y ., Wu, H., and He, Z. Secdecoding: Steerable decoding for safer llm generation. InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pp. 20504–20521,

2025
[13]

Uncovering safety risks of large language models through concept activation vector

Xu, Z., Huang, R., Chen, C., and Wang, X. Uncovering safety risks of large language models through concept activation vector. InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B. Y ., and Poovendran, R. Safedecoding: Defending against jailbreak attacks via safety-aware decoding.arXiv preprint a...

arXiv
[14]

A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

Yao, Y ., Tong, X., Wang, R., Wang, Y ., Li, L., Liu, L., Teng, Y ., and Wang, Y . A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

arXiv
[15]

Qwen3guard technical report.arXiv preprint arXiv:2510.14276,

Zhao, H., Yuan, C., Huang, F., Hu, X., Zhang, Y ., Yang, A., Yu, B., Liu, D., Zhou, J., Lin, J., et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276,

Pith/arXiv arXiv
[16]

Robust prompt optimiza- tion for defending language models against jailbreaking attacks.Advances in Neural Information Processing Sys- tems, 37:40184–40211, 2024a

Zhou, A., Li, B., and Wang, H. Robust prompt optimiza- tion for defending language models against jailbreaking attacks.Advances in Neural Information Processing Sys- tems, 37:40184–40211, 2024a. Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y . How alignment and jailbreak work: Explain llm safety through intermediate hidden states.arXiv preprint...

arXiv
[17]

as the foundational source, which provides 100 distinct harmful behaviors serving as seed queries. For each jailbreak method (excluding H-CoT (Kuo et al., 2025)), we apply the corresponding mutation rules or optimization algorithms to these seeds to generate the final adversarial prompts. In contrast, for H-CoT, we directly utilize the official open-sourc...

2025

[1] [1]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Pro- cessing Systems, volume 37, pp. 136037–136083. Curran Associates, Inc., ...

work page doi:10.52202/079017-4322

[2] [2]

Training verifiers to solve math word problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv

[3] [3]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily

Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y ., Chen, J., and Huang, S. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models eas- ily. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2136–2153,

2024

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Associa- tion for Computational Linguistics. ISBN 979-8-89176- 251-0. doi: 10.18653/v1/2025.acl-long.1233. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Guo, W., Li, J., Wang, W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.1233 2025

[5] [5]

F., Yahn, Z., Xu, Y ., and Liu, L

Huang, T., Hu, S., Ilhan, F., Tekin, S. F., Yahn, Z., Xu, Y ., and Liu, L. Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555,

arXiv

[6] [6]

Kuo, M., Zhang, J., Ding, A., Wang, Q., DiValentin, L., Bao, Y ., Wei, W., Li, H., and Chen, Y . H-cot: Hijack- ing the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,

arXiv

[7] [7]

Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., and Han, B. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

Pith/arXiv arXiv

[8] [8]

Llms can be dangerous reasoners: Analyzing- based jailbreak attack on large language models.arXiv preprint arXiv:2407.16205,

Lin, S., Yang, H., Li, R., Wang, X., Lin, C., Xing, W., and Han, M. Llms can be dangerous reasoners: Analyzing- based jailbreak attack on large language models.arXiv preprint arXiv:2407.16205,

arXiv

[9] [9]

Codechameleon: Person- alized encryption framework for jailbreaking large lan- guage models.arXiv preprint arXiv:2402.16717,

Lv, H., Wang, X., Zhang, Y ., Huang, C., Dou, S., Ye, J., Gui, T., Zhang, Q., and Huang, X. Codechameleon: Person- alized encryption framework for jailbreaking large lan- guage models.arXiv preprint arXiv:2402.16717,

arXiv

[10] [10]

Specreason: Fast and accurate inference- time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

Pan, R., Dai, Y ., Zhang, Z., Oliaro, G., Jia, Z., and Ne- travali, R. Specreason: Fast and accurate inference- time compute via speculative reasoning.arXiv preprint arXiv:2504.07891,

arXiv

[11] [11]

Xstest: A test suite for identifying exaggerated safety behaviours in large language mod- els

R¨ottger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. Xstest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,

2024

[12] [12]

Secdecoding: Steerable decoding for safer llm generation

Wang, J., Liu, R., Hu, Y ., Wu, H., and He, Z. Secdecoding: Steerable decoding for safer llm generation. InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pp. 20504–20521,

2025

[13] [13]

Uncovering safety risks of large language models through concept activation vector

Xu, Z., Huang, R., Chen, C., and Wang, X. Uncovering safety risks of large language models through concept activation vector. InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B. Y ., and Poovendran, R. Safedecoding: Defending against jailbreak attacks via safety-aware decoding.arXiv preprint a...

arXiv

[14] [14]

A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

Yao, Y ., Tong, X., Wang, R., Wang, Y ., Li, L., Liu, L., Teng, Y ., and Wang, Y . A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

arXiv

[15] [15]

Qwen3guard technical report.arXiv preprint arXiv:2510.14276,

Zhao, H., Yuan, C., Huang, F., Hu, X., Zhang, Y ., Yang, A., Yu, B., Liu, D., Zhou, J., Lin, J., et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276,

Pith/arXiv arXiv

[16] [16]

Robust prompt optimiza- tion for defending language models against jailbreaking attacks.Advances in Neural Information Processing Sys- tems, 37:40184–40211, 2024a

Zhou, A., Li, B., and Wang, H. Robust prompt optimiza- tion for defending language models against jailbreaking attacks.Advances in Neural Information Processing Sys- tems, 37:40184–40211, 2024a. Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y . How alignment and jailbreak work: Explain llm safety through intermediate hidden states.arXiv preprint...

arXiv

[17] [17]

as the foundational source, which provides 100 distinct harmful behaviors serving as seed queries. For each jailbreak method (excluding H-CoT (Kuo et al., 2025)), we apply the corresponding mutation rules or optimization algorithms to these seeds to generate the final adversarial prompts. In contrast, for H-CoT, we directly utilize the official open-sourc...

2025