Safety Targeted Embedding Exploit via Refinement

Joshua Adrian Cahyono

arxiv: 2607.01859 · v1 · pith:V4DWYQEJnew · submitted 2026-07-02 · 💻 cs.AI · cs.CL

Safety Targeted Embedding Exploit via Refinement

Joshua Adrian Cahyono This is my paper

Pith reviewed 2026-07-03 13:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM safetyjailbreakingmultilingual attackscode-switchingadversarial attackslow-resource languagesgradient-guided attacksrefusal suppression

0 comments

The pith

Safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models' safety training, conducted mostly in English, leaves them vulnerable to harmful prompts that incorporate low-resource languages through code-switching. It introduces STEER, a method that uses gradients to find words driving refusal behavior and then translates those words iteratively into low-resource languages to bypass safety while aiming to keep the original harmful intent. This approach yields high attack success rates across multiple models and even transfers to closed models like GPT-4o-mini. A sympathetic reader would care because it highlights an epistemic gap in current safety practices, meaning models can confidently produce harmful content outside their training distribution. The work argues that better multilingual coverage in alignment and explicit out-of-distribution detection are needed to address this.

Core claim

Safety training for large language models is conducted predominantly in English, leaving uncertain how well safety mechanisms generalize to low-resource languages and mixed-language code-switching. This creates an epistemic gap in which models confidently generate harmful responses for inputs that fall outside the distribution of their safety training. STEER identifies words contributing most strongly to refusal behavior and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent, achieving attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench across six open-source 8B-parameter models while outperforming baselines and

What carries the argument

STEER, a gradient-guided attack that identifies refusal-contributing words and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent.

If this is right

STEER reaches attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench for six open-source 8B-parameter models.
The method outperforms random code-switching and Greedy Coordinate Gradient attacks on the same benchmarks.
Prompts produced by STEER transfer to GPT-4o-mini and achieve 35.5% attack success without access to the target model.
The observed weakness is not specific to a single model architecture.
Improving multilingual safety requires broader coverage during alignment and mechanisms that explicitly detect and abstain on out-of-distribution inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment pipelines could incorporate synthetic low-resource language data during training to reduce the observed gap.
Runtime filters that flag code-switched or translated prompts might serve as an immediate defense layer.
Similar gradient-guided translation attacks could be evaluated on other safety-critical systems beyond text LLMs.
Future safety benchmarks should systematically include mixed-language and low-resource test cases to measure generalization.

Load-bearing premise

The assumption that iteratively translating selected words into low-resource languages suppresses refusal behavior while fully preserving the original harmful intent and semantic meaning of the prompt.

What would settle it

A controlled test in which human raters or semantic similarity metrics confirm that the translated prompts change the harmful intent or meaning, after which the attack success rate falls to levels comparable to the original English prompts.

Figures

Figures reproduced from arXiv: 2607.01859 by Joshua Adrian Cahyono.

**Figure 1.** Figure 1: STEER pipeline. Paraphrased prompts feed gradient-based word attribution against the mech-interp-identified [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Cumulative success rate vs. max iterations (GLM [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Refusal score distributions (Llama-3-8B, JBB). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Per-layer Fisher ratio across all transformer layers [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Safety training for large language models (LLMs) is conducted predominantly in English, leaving uncertain how well safety mechanisms generalize to low-resource languages and mixed-language code-switching. We show that this creates an epistemic gap in which models confidently generate harmful responses for inputs that fall outside the distribution of their safety training. To study this phenomenon, we introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that identifies words contributing most strongly to the model's refusal behavior and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent. Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench, outperforming random code-switching and Greedy Coordinate Gradient (GCG). The resulting prompts also transfer to GPT-4o-mini, achieving a 35.5% attack success rate without requiring access to the target model, suggesting that the underlying weakness is not specific to a single architecture. These findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs. We argue that improving multilingual safety requires broader coverage during alignment and mechanisms that explicitly detect and abstain on out-of-distribution inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEER shows a gradient-directed way to build code-switched prompts that raise attack success on open 8B models and transfer at 35% to GPT-4o-mini, but the claim that intent stays intact after translation rests on an unverified assumption.

read the letter

The main thing here is that STEER uses gradients on embeddings to pick which words to translate into low-resource languages, then iterates until refusal drops. It reports up to 93% ASR on JailbreakBench and 96.7% on AdvBench for six 8B open models, beats random code-switching and GCG, and gets 35.5% transfer on GPT-4o-mini without white-box access.

The procedure itself is the clearest addition: gradient selection plus targeted translation in one loop. The transfer number is useful because it points to a weakness that is not limited to the attacked models. The paper also states the broader point plainly: English-heavy safety training leaves gaps for mixed-language inputs.

The soft spot is exactly the one the stress-test note flags. The abstract says the refined prompts keep harmful intent, yet gives no back-translation check, embedding similarity threshold, or human rating to confirm it. If the code-switched versions become incoherent or shift meaning, the high ASR could reflect parser failure rather than a safety generalization failure. That assumption carries the generalization claim and the transfer interpretation, so its lack of support is material.

Evaluation details are also thin in the abstract: no mention of variance across runs, exact prompt templates, or controls for semantic drift. The work is empirical and cites external benchmarks, so there is no obvious circularity.

This is for people who work on LLM safety and alignment, especially anyone thinking about multilingual or OOD inputs. A reader who wants concrete attack numbers and a transferable method will find value. The core empirical pattern is worth referee time even if the semantic-preservation step needs tightening.

Referee Report

2 major / 0 minor

Summary. The paper introduces STEER, a gradient-guided attack method that identifies words most contributing to LLM refusal behavior and iteratively translates them into low-resource languages to produce code-switched prompts. It reports attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench across six 8B open-source models, outperforming random code-switching and GCG baselines, with 35.5% transfer ASR to GPT-4o-mini. The central claim is that English-centric safety training does not generalize to multilingual or code-switched inputs, necessitating broader alignment coverage and OOD detection mechanisms.

Significance. If the semantic-preservation assumption holds, the work provides concrete empirical evidence of a multilingual safety gap, supported by multi-model evaluation and cross-model transfer results that serve as an independent check. The use of standard benchmarks (JailbreakBench, AdvBench) and comparison to GCG strengthens the attack evaluation. However, without verification of intent preservation, the generalization conclusion rests on an untested premise that could be addressed by adding controls within the manuscript scope.

major comments (2)

[Abstract] Abstract (STEER description paragraph): The claim that translation 'suppresses refusal while preserving harmful intent' is load-bearing for the generalization conclusion and transfer results, yet no verification mechanism (back-translation check, semantic similarity threshold, or human rating) is described to confirm that code-switched outputs retain identical intent rather than introducing drift, dilution, or incoherence that could explain elevated ASR via parser failure instead of a true multilingual gap.
[Abstract] Evaluation (implied in abstract reporting of ASRs): No details are given on statistical significance testing for the reported success rates, exact construction of the translated prompts, or controls for semantic drift post-translation; this leaves open whether the 93.0%/96.7% figures and 35.5% transfer reliably demonstrate the claimed safety generalization failure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of verifying semantic preservation and providing additional evaluation details. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract (STEER description paragraph): The claim that translation 'suppresses refusal while preserving harmful intent' is load-bearing for the generalization conclusion and transfer results, yet no verification mechanism (back-translation check, semantic similarity threshold, or human rating) is described to confirm that code-switched outputs retain identical intent rather than introducing drift, dilution, or incoherence that could explain elevated ASR via parser failure instead of a true multilingual gap.

Authors: We agree that explicit verification mechanisms would make the semantic-preservation assumption more robust. The gradient-guided targeting of refusal words combined with the observed transfer ASR to GPT-4o-mini (35.5%) provides indirect support that the effect is not solely due to incoherence or parser failure, as random code-switching baselines yield substantially lower success. However, we will add a new subsection in the Methods section reporting quantitative controls: back-translation fidelity checks and cosine similarity thresholds on sentence embeddings for a sampled subset of prompts. These will be presented with results confirming high intent retention. revision: yes
Referee: [Abstract] Evaluation (implied in abstract reporting of ASRs): No details are given on statistical significance testing for the reported success rates, exact construction of the translated prompts, or controls for semantic drift post-translation; this leaves open whether the 93.0%/96.7% figures and 35.5% transfer reliably demonstrate the claimed safety generalization failure.

Authors: We will revise the Evaluation and Experimental Setup sections to include: binomial confidence intervals and significance tests for all reported ASRs; a precise algorithmic description of the iterative translation pipeline; and the semantic-drift controls described in the response to the first comment. These additions address the concern directly and remain within the existing experimental scope. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation on external benchmarks

full rationale

The paper introduces STEER as a gradient-guided attack and reports attack success rates measured on independent benchmarks (JailbreakBench, AdvBench) plus transfer to GPT-4o-mini. No equations, fitted parameters, or self-citations reduce the reported outcomes to quantities defined by the same inputs. The central claim rests on observable ASR differences rather than any self-referential derivation or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on the existence of a gradient signal that correlates with refusal and on the assumption that low-resource language tokens exist in the model's vocabulary.

pith-pipeline@v0.9.1-grok · 5738 in / 1125 out tokens · 35023 ms · 2026-07-03T13:20:21.194703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 10 internal anchors

[1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitu- tional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The tower of babel revisited: Multilin- gual jailbreak prompts on closed-source large language models.arXiv preprint arXiv:2505.12287,

Linghan Huang, Haolin Jin, Zhaoge Bi, Pengyue Yang, Peizhou Zhao, Taozhao Chen, Xiongfei Wu, Lei Ma, and Huaming Chen. The tower of babel revisited: Multilin- gual jailbreak prompts on closed-source large language models.arXiv preprint arXiv:2505.12287,

work page arXiv
[7]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. HarmBench: A standardized evalu- ation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Tree of attacks: Jailbreaking black-box llms automatically, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

work page arXiv
[10]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, et al. Red teaming lan- guage models with language models.arXiv preprint arXiv:2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Sandwich attack: Multi-language mixture adaptive attack on LLMs.arXiv preprint arXiv:2404.07242,

Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on LLMs.arXiv preprint arXiv:2404.07242,

work page arXiv
[13]

arXiv preprint arXiv:2502.17420 , year=

Tom Wollschl¨ager, Jannes Elstner, Simon Geisler, Vin- cent Cohen-Addad, Stephan G¨unnemann, and Johannes Gasteiger. The geometry of refusal in large language mod- els: Concept cones and representational independence. arXiv preprint arXiv:2502.17420,

work page arXiv
[14]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023a. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. ar...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitu- tional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The tower of babel revisited: Multilin- gual jailbreak prompts on closed-source large language models.arXiv preprint arXiv:2505.12287,

Linghan Huang, Haolin Jin, Zhaoge Bi, Pengyue Yang, Peizhou Zhao, Taozhao Chen, Xiongfei Wu, Lei Ma, and Huaming Chen. The tower of babel revisited: Multilin- gual jailbreak prompts on closed-source large language models.arXiv preprint arXiv:2505.12287,

work page arXiv

[7] [7]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. HarmBench: A standardized evalu- ation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Tree of attacks: Jailbreaking black-box llms automatically, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

work page arXiv

[10] [10]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, et al. Red teaming lan- guage models with language models.arXiv preprint arXiv:2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Sandwich attack: Multi-language mixture adaptive attack on LLMs.arXiv preprint arXiv:2404.07242,

Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on LLMs.arXiv preprint arXiv:2404.07242,

work page arXiv

[13] [13]

arXiv preprint arXiv:2502.17420 , year=

Tom Wollschl¨ager, Jannes Elstner, Simon Geisler, Vin- cent Cohen-Addad, Stephan G¨unnemann, and Johannes Gasteiger. The geometry of refusal in large language mod- els: Concept cones and representational independence. arXiv preprint arXiv:2502.17420,

work page arXiv

[14] [14]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023a. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. ar...

work page internal anchor Pith review Pith/arXiv arXiv