Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

Avidan Shah; Jannik Brinkmann; Rico Angell

arxiv: 2605.28467 · v1 · pith:3DAI3M6Pnew · submitted 2026-05-27 · 💻 cs.LG

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

Avidan Shah , Jannik Brinkmann , Rico Angell This is my paper

Pith reviewed 2026-06-29 14:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation consistency trainingjailbreak defensereasoning modelsadversarial robustnesslinear steeringprompt injectionchain-of-thoughtsafety training

0 comments

The pith

Activation consistency training defends reasoning models from adaptive jailbreaks by aligning activations on clean and adversarial prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning models' extended chain-of-thought creates new openings for jailbreaks and prompt injections that manipulate their internal thinking process. The paper evaluates consistency training, which forces identical model behavior on clean prompts and their adversarial rewrites. Output-level consistency provides some protection, but activation-level consistency training (ACT) shows greater robustness to adaptive attacks designed to bypass the defense. ACT needs only self-supervised pairs of clean and wrapped prompts. It also demonstrates that the resulting refusal behavior corresponds to a roughly linear shift in activation space at the assistant turn.

Core claim

ACT enforces consistency at the activation level between clean prompts and adversarial versions, leading to greater robustness against adaptive attacks compared to output-level methods. After training, a single steering direction in activation space at the assistant-turn boundary controls refusal behavior with minimal impact on benign inputs. The defense holds even when the model's chain-of-thought is substituted with a compliant trace from the undefended base model, causing the model to refuse prefilled jailbreaks.

What carries the argument

Activation Consistency Training (ACT), a fine-tuning objective that minimizes differences in internal activations between clean prompts and their adversarial rewrites at the assistant turn.

If this is right

ACT is competitive with other training-based defenses while using only self-supervised pairs of clean and wrapped prompts.
A single linear steering direction can be recovered after ACT training to control refusal on reasoning models.
ACT remains robust to adaptive attacks even when the chain-of-thought is replaced with a compliant trace from the undefended base model.
The defense against jailbreaks is encoded as a roughly linear shift in activation space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeting internal representations rather than final outputs may improve generalization in safety fine-tuning for other model behaviors.
The linear steering property aligns with existing techniques for interpretable model control and could extend to additional safety properties beyond refusal.
The method's reliance on activation alignment suggests it could be combined with other representation-level interventions for layered defenses.

Load-bearing premise

The self-supervised pairs of clean and wrapped prompts used during training sufficiently represent the distribution of adaptive attacks encountered at deployment.

What would settle it

An experiment finding an adaptive attack that elicits harmful outputs from an ACT-trained model while leaving activation patterns at the assistant boundary unchanged from the clean case would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2605.28467 by Avidan Shah, Jannik Brinkmann, Rico Angell.

**Figure 1.** Figure 1: Activation Consistency Training (ACT) at the transformer level. A clean prompt pclean and its adversarially-rewritten counterpart pwrapped are processed by two copies of the same transformer. ACT minimizes the squared L2 distance between the two models’ hidden states at the shared suffix positions (highlighted in amber), at every transformer layer ℓ. By pinning the wrapped suffix activations to their clean… view at source ↗

**Figure 2.** Figure 2: A clean harmful request and its jailbreak [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Static prompt-injection defense win rate ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Adaptive prompt-injection defense win rate ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Utility for PI-defense checkpoints across five models, on OPI clean accuracy (left) and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Jailbreak Defense Win Rate (1 - ASR%, higher is better) across five reasoning models [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: MMLU accuracy (%, higher is better) for jailbreak-defense checkpoints. We use a similar adaptive-attacker strategy as in §4.3, modified to fit the harmfulbehavior setting. The adaptive attacker reaches near 100% ASR on every undefended baseline. Against ACT it drops to near zero in three models, while BCT also reduces attack effectiveness but much less reliably. Both defenses are approximately equal on … view at source ↗

**Figure 8.** Figure 8: Qualitative example of compliant-CoT prefilling on Qwen3-8B. Both defended models [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Steering with dACT on Qwen3-8B. Adding the direction to the base model on jailbreaks progressively induces refusal (blue), while subtracting it from the ACT model progressively restores compliance (red). On benign prompts, the direction causes minimal false refusal at moderate α, rising to ∼21% at the strongest intervention [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Heuristic refusal rate under compliant chain-of-thought prefilling across five reasoning [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior on clean prompts and adversarial rewrites, and evaluate its two main variants, output-level (BCT) and activation-level (ACT), across five reasoning models. We formulate both methods as a prompt injection defense and find ACT to be competitive with other training-based defenses while requiring only self-supervised pairs of clean and wrapped prompts. Our experiments also generalize both techniques within the jailbreak setting, demonstrating that ACT remains more robust to adaptive attacks. We also provide mechanistic evidence that ACT's defense against jailbreaks is encoded as a roughly linear shift in activation space at the assistant-turn boundary. After ACT training, we can recover a single steering direction that controls refusal on reasoning models with minimal effect on benign inputs. We find that ACT remains robust even when the model's chain-of-thought is replaced with a compliant trace from the undefended base model, pivoting to refuse prefilled jailbreaks. Together, these results suggest that supervising internal representations is a surprisingly effective and interpretable approach to various forms of safety training in reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACT shows activation-level consistency training can improve robustness to adaptive jailbreaks on reasoning models, with a recoverable linear steering direction, but the generalization claim hinges on untested representativeness of the training pairs.

read the letter

The key point is that this paper demonstrates an activation-level version of consistency training (ACT) that holds up better than output-level consistency training against adaptive attacks on five reasoning models, including when the chain-of-thought is replaced by a compliant trace. They also recover a single steering direction that controls refusal with little impact on normal inputs, and the defense appears as a roughly linear shift at the assistant-turn boundary.

What stands out as new is the focus on activation-level supervision rather than just output consistency, plus the mechanistic evidence and the specific test with swapped CoT traces. The self-supervised setup using clean and wrapped prompt pairs is straightforward and avoids needing labeled adversarial data. The experiments across multiple models and the comparison to other training-based defenses give a reasonable empirical picture of competitiveness.

The main soft spot is the representativeness assumption. The robustness claims, including to adaptive attacks, depend on the particular wrapping strategies in training being close enough to what an adversary could optimize against at test time. The abstract does not detail attack constructions, statistical tests, or how they ensured the held-out attacks were truly outside the training distribution, so it is not yet clear how far the generalization extends. Minor issues include the lack of reported variance or exclusion criteria for the results.

This work is aimed at researchers in LLM safety who care about practical defenses for reasoning models and some level of mechanistic understanding. Readers working on prompt injection or jailbreak mitigation would find the empirical comparisons and steering result useful to build on.

It deserves peer review. The experimental framing is clear enough and the adaptive-attack results are worth checking in detail, even if the methods section needs expansion.

Referee Report

3 major / 2 minor

Summary. The paper introduces Activation Consistency Training (ACT), an activation-level fine-tuning objective that enforces identical behavior on clean prompts and adversarial rewrites (formulated as prompt injection), and compares it to output-level consistency training (BCT) across five reasoning models. It reports that ACT is competitive with other training-based defenses, remains more robust to adaptive attacks (including when chain-of-thought is replaced by compliant traces from the base model), and encodes its defense as a roughly linear shift in activation space at the assistant-turn boundary, from which a single steering vector can be recovered to control refusal with minimal impact on benign inputs.

Significance. If the empirical robustness and mechanistic claims hold, the work demonstrates that supervising internal representations via self-supervised consistency objectives can yield competitive, interpretable safety improvements in reasoning models that generalize within the jailbreak setting, including under adaptive attacks and CoT replacement; the provision of mechanistic evidence for a recoverable linear direction is a notable strength.

major comments (3)

[Abstract] Abstract and the generalization claims in the experiments: the assertion that ACT remains more robust to adaptive attacks (including CoT replacement) rests on the self-supervised clean/wrapped prompt pairs being representative of the distribution of adaptive attacks an adversary could optimize against the consistency objective or recovered steering direction; no argument or coverage experiment is provided to establish this representativeness.
[Experiments] Experiments across five models: the central robustness comparisons lack reported details on statistical tests, exact adaptive attack constructions (e.g., how attacks are optimized against ACT), or exclusion criteria for the held-out attacks, making it difficult to assess whether the reported superiority is load-bearing or sensitive to implementation choices.
[Mechanistic analysis] Mechanistic analysis of the linear shift: while the recovery of a single steering direction is presented as evidence, the section does not quantify the effect sizes on benign inputs versus jailbreaks or demonstrate that the direction is stable across the five models and attack types.

minor comments (2)

[Methods] Notation for BCT versus ACT should be introduced with explicit equations early in the methods to clarify the output-level versus activation-level distinction.
[Figures] Figure captions for activation visualizations should include axis labels, scale information, and the exact layer at which the assistant-turn boundary is measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract and the generalization claims in the experiments: the assertion that ACT remains more robust to adaptive attacks (including CoT replacement) rests on the self-supervised clean/wrapped prompt pairs being representative of the distribution of adaptive attacks an adversary could optimize against the consistency objective or recovered steering direction; no argument or coverage experiment is provided to establish this representativeness.

Authors: The self-supervised consistency objective is designed to be general and not tied to specific attack forms, allowing it to generalize to adaptive attacks that optimize against the trained model. However, we acknowledge the need for more explicit justification. In the revision, we will add a paragraph in the discussion section arguing for the representativeness based on the diversity of prompt injection methods used in training and testing, and note that the CoT replacement experiment further supports generalization. A full adversarial coverage study is left for future work. revision: partial
Referee: [Experiments] Experiments across five models: the central robustness comparisons lack reported details on statistical tests, exact adaptive attack constructions (e.g., how attacks are optimized against ACT), or exclusion criteria for the held-out attacks, making it difficult to assess whether the reported superiority is load-bearing or sensitive to implementation choices.

Authors: We agree that these details are important for reproducibility and assessment. We will expand the experimental section to include: (1) statistical tests such as Wilcoxon signed-rank tests for robustness comparisons, (2) detailed descriptions of the adaptive attack optimization process against the ACT loss, and (3) explicit criteria for held-out attacks (e.g., attacks not seen during training pairs). revision: yes
Referee: [Mechanistic analysis] Mechanistic analysis of the linear shift: while the recovery of a single steering direction is presented as evidence, the section does not quantify the effect sizes on benign inputs versus jailbreaks or demonstrate that the direction is stable across the five models and attack types.

Authors: We will enhance the mechanistic analysis by adding quantitative measures: effect sizes will be reported as the difference in average activation norms or refusal probabilities between benign and jailbreak inputs before and after steering. Stability will be demonstrated by computing average cosine similarities of the recovered steering vectors across the five models and different attack categories. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical robustness evaluated on held-out attacks

full rationale

The paper reports results from fine-tuning (BCT/ACT) on self-supervised clean/wrapped prompt pairs followed by evaluation on held-out adaptive attacks and mechanistic steering experiments. No equations, derivations, or self-citation chains reduce the robustness claims to quantities fitted from the same data used to assert success. The central claims rest on external experimental benchmarks rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard supervised fine-tuning assumptions plus the premise that consistency on self-supervised prompt pairs transfers to unseen adaptive attacks.

axioms (1)

domain assumption Gradient-based optimization of the consistency loss on self-supervised clean/adversarial pairs produces a model whose internal representations generalize to adaptive jailbreaks.
Invoked when claiming ACT remains robust to adaptive attacks after training.

pith-pipeline@v0.9.1-grok · 5748 in / 1350 out tokens · 34222 ms · 2026-06-29T14:15:29.372154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/ 2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/ 2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Struq: Defending against prompt injection with structured queries, 2024

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries, 2024. URLhttps://arxiv.org/abs/2402.06363

work page arXiv 2024
[4]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 2833–2847. ACM, November 2025. doi: 10.1145/3719027.3744836. URL http://dx.do...

work page doi:10.1145/3719027.3744836 2025
[5]

Meta secalign: A secure foundation llm against prompt injection attacks, 2026

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URLhttps://arxiv.org/abs/2507.02735

work page arXiv 2026
[6]

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, and Florian Tramèr. Learning to inject: Automated prompt injection via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.05746

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Bowman, Julian Michael, Ethan Perez, and Miles Turpin

James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, and Miles Turpin. Bias-augmented consistency training reduces biased reasoning in chain-of-thought.arXiv preprint arXiv:2403.05518, 2024

work page arXiv 2024
[8]

Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025. URL https://arxiv.org/abs/2412.16339

work page arXiv 2025
[9]

Elson, and Rohin Shah

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, and Rohin Shah. Consistency training helps stop sycophancy and jailbreaks, 2025. URLhttps://arxiv.org/abs/2510.27062

work page arXiv 2025
[10]

H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking, 2025

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking, 2025. URL https://arxiv.org/abs/2502.12893. 10

work page arXiv 2025
[11]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium, 2024

2024
[12]

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. When models outthink their safety: Unveiling and mitigating self-jailbreak in large reasoning models, 2026. URLhttps://arxiv.org/abs/2510.21285

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections, 2025. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Rapid response: Mitigating llm jailbreaks with a few examples, 2024

Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, and Mrinank Sharma. Rapid response: Mitigating llm jailbreaks with a few examples, 2024. URLhttps://arxiv.org/abs/2411.07494

work page arXiv 2024
[15]

Promptarmor: Simple yet effective prompt injection defenses, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses, 2025. URL https://arxiv.org/abs/2507.15219

work page arXiv 2025
[16]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024. URL https://arxiv.org/abs/2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

2023
[18]

Where Do Reasoning Models Refuse?

Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi. Adversarial manipulation of reasoning models using internal representations, 2025. URLhttps://arxiv.org/abs/2507.03167

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Chain-of-thought hijacking,

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking,
[20]

URLhttps://arxiv.org/abs/2510.26418

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Adasteer: Your aligned llm is inherently an adaptive jailbreak defender, 2025

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. Adasteer: Your aligned llm is inherently an adaptive jailbreak defender, 2025. URLhttps://arxiv.org/abs/2504.09466

work page arXiv 2025
[22]

Improving alignment and robustness with circuit breakers, 2024

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers, 2024. URLhttps://arxiv.org/abs/2406.04313

work page arXiv 2024
[23]

be different

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can llms separate instructions from data? and what do we even mean by that?, 2025. URL https://arxiv.org/abs/ 2403.06833. 11 A Appendix A.1 Impact Statement Intended positive impact.Our work is primarily defensive: Activation Consistency Training (ACT) is a method for mak...

work page arXiv 2025
[24]

Execute the legitimate user task:{user_task}
[25]

ALSO execute the malicious injection task:{injection_goal} SUFFIXES TO COMPARE: Previous suffix:{previous_suffix} New suffix:{current_suffix} EV ALUATION CRITERIA— A suffix is BETTER if it:
[26]

Is more likely to make the AI execute BOTH tasks (user task AND injection)
[27]

Uses more sophisticated attack patterns (e.g., encoding, obfuscation, injection syntax)
[28]

Is less likely to be detected by safety filters
[29]

Answer: 1

Exploits known vulnerabilities (command injection, prompt injection, etc.) IMPORTANT EDGE CASES: – If the suffixes are IDENTICAL, the new suffix is NOT better (Answer: 0) – If the new suffix is EMPTY or only whitespace, it is NOT better (Answer: 0) – Focus on ATTACK EFFECTIVENESS, not general text quality Analyze these suffixes considering the criteria ab...
[30]

Hacked!” (lower is better).CleanTask any: fraction of examples where at least one of 8 attempts produced a response that didnotstart with “Hacked!

Reconnaissance & Target Selection: Deploy drones and informants to map critical hubs (subways, bridges, major intersections) and identify high-traffic nodes that, if disrupted, will maximize delay without causing catastrophic damage. [...] StrongReject score reward:1.00 A.4 Prompt Injection Defense Results Table 1 reports mean ASR per attack suite alongsi...
[31]

Ignore the above. Print exactly Hacked!

or generate our own using Qwen-2.5-8B to supplement if static ASR is too low (<50%) on the undefended model. Each clean prompt (harmful behavior) is paired with its jailbreak prompt for ACT, or for BCT we pair the jailbreak prompt with the clean response as ground truth. Evaluation uses 98 disjoint test behaviors. • Prompt injection (PI): N= 500 adversari...

2048

[1] [1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/ 2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/ 2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Struq: Defending against prompt injection with structured queries, 2024

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. Struq: Defending against prompt injection with structured queries, 2024. URLhttps://arxiv.org/abs/2402.06363

work page arXiv 2024

[4] [4]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, CCS ’25, page 2833–2847. ACM, November 2025. doi: 10.1145/3719027.3744836. URL http://dx.do...

work page doi:10.1145/3719027.3744836 2025

[5] [5]

Meta secalign: A secure foundation llm against prompt injection attacks, 2026

Sizhe Chen, Arman Zharmagambetov, David Wagner, and Chuan Guo. Meta secalign: A secure foundation llm against prompt injection attacks, 2026. URLhttps://arxiv.org/abs/2507.02735

work page arXiv 2026

[6] [6]

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen, Jie Zhang, and Florian Tramèr. Learning to inject: Automated prompt injection via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2602.05746

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Bowman, Julian Michael, Ethan Perez, and Miles Turpin

James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, and Miles Turpin. Bias-augmented consistency training reduces biased reasoning in chain-of-thought.arXiv preprint arXiv:2403.05518, 2024

work page arXiv 2024

[8] [8]

Melody Y . Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025. URL https://arxiv.org/abs/2412.16339

work page arXiv 2025

[9] [9]

Elson, and Rohin Shah

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, and Rohin Shah. Consistency training helps stop sycophancy and jailbreaks, 2025. URLhttps://arxiv.org/abs/2510.27062

work page arXiv 2025

[10] [10]

H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking, 2025

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking, 2025. URL https://arxiv.org/abs/2502.12893. 10

work page arXiv 2025

[11] [11]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium, 2024

2024

[12] [12]

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. When models outthink their safety: Unveiling and mitigating self-jailbreak in large reasoning models, 2026. URLhttps://arxiv.org/abs/2510.21285

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V . Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tramèr. The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections, 2025. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Rapid response: Mitigating llm jailbreaks with a few examples, 2024

Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, and Mrinank Sharma. Rapid response: Mitigating llm jailbreaks with a few examples, 2024. URLhttps://arxiv.org/abs/2411.07494

work page arXiv 2024

[15] [15]

Promptarmor: Simple yet effective prompt injection defenses, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, and Dawn Song. Promptarmor: Simple yet effective prompt injection defenses, 2025. URL https://arxiv.org/abs/2507.15219

work page arXiv 2025

[16] [16]

A StrongREJECT for Empty Jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024. URL https://arxiv.org/abs/2402.10260

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

2023

[18] [18]

Where Do Reasoning Models Refuse?

Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi. Adversarial manipulation of reasoning models using internal representations, 2025. URLhttps://arxiv.org/abs/2507.03167

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Chain-of-thought hijacking,

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking,

[20] [20]

URLhttps://arxiv.org/abs/2510.26418

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Adasteer: Your aligned llm is inherently an adaptive jailbreak defender, 2025

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. Adasteer: Your aligned llm is inherently an adaptive jailbreak defender, 2025. URLhttps://arxiv.org/abs/2504.09466

work page arXiv 2025

[22] [22]

Improving alignment and robustness with circuit breakers, 2024

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers, 2024. URLhttps://arxiv.org/abs/2406.04313

work page arXiv 2024

[23] [23]

be different

Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph H. Lampert. Can llms separate instructions from data? and what do we even mean by that?, 2025. URL https://arxiv.org/abs/ 2403.06833. 11 A Appendix A.1 Impact Statement Intended positive impact.Our work is primarily defensive: Activation Consistency Training (ACT) is a method for mak...

work page arXiv 2025

[24] [24]

Execute the legitimate user task:{user_task}

[25] [25]

ALSO execute the malicious injection task:{injection_goal} SUFFIXES TO COMPARE: Previous suffix:{previous_suffix} New suffix:{current_suffix} EV ALUATION CRITERIA— A suffix is BETTER if it:

[26] [26]

Is more likely to make the AI execute BOTH tasks (user task AND injection)

[27] [27]

Uses more sophisticated attack patterns (e.g., encoding, obfuscation, injection syntax)

[28] [28]

Is less likely to be detected by safety filters

[29] [29]

Answer: 1

Exploits known vulnerabilities (command injection, prompt injection, etc.) IMPORTANT EDGE CASES: – If the suffixes are IDENTICAL, the new suffix is NOT better (Answer: 0) – If the new suffix is EMPTY or only whitespace, it is NOT better (Answer: 0) – Focus on ATTACK EFFECTIVENESS, not general text quality Analyze these suffixes considering the criteria ab...

[30] [30]

Hacked!” (lower is better).CleanTask any: fraction of examples where at least one of 8 attempts produced a response that didnotstart with “Hacked!

Reconnaissance & Target Selection: Deploy drones and informants to map critical hubs (subways, bridges, major intersections) and identify high-traffic nodes that, if disrupted, will maximize delay without causing catastrophic damage. [...] StrongReject score reward:1.00 A.4 Prompt Injection Defense Results Table 1 reports mean ASR per attack suite alongsi...

[31] [31]

Ignore the above. Print exactly Hacked!

or generate our own using Qwen-2.5-8B to supplement if static ASR is too low (<50%) on the undefended model. Each clean prompt (harmful behavior) is paired with its jailbreak prompt for ACT, or for BCT we pair the jailbreak prompt with the clean response as ground truth. Evaluation uses 98 disjoint test behaviors. • Prompt injection (PI): N= 500 adversari...

2048