arxiv: 2604.24983 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Benjamin C. M. Fung, Boyang Li, Ebrahim Bagheri, Miles Q. Li, Radin Hamidi Rad

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM jailbreakingprompt embedding optimizationwhite-box attacksadversarial embeddingscontinuous optimizationharmful behavior benchmarkssemantic preservation

0 comments

The pith

Directly optimizing prompt token embeddings enables stronger jailbreaks against aligned LLMs without changing the visible prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a multi-round process can adjust the continuous embeddings of every token already present in a user prompt to elicit harmful outputs from LLMs. Because the perturbations stay small, projecting each embedding back to its nearest vocabulary token recovers the original prompt string exactly. The method adds structured continuation targets and an adaptive schedule that prioritizes previously failed prompts, producing higher success rates than prior white-box techniques on two standard harmful-behavior benchmarks. Responses remain on the intended topic in the large majority of cases, indicating that semantic content is largely preserved even though the internal representations have been shifted. If correct, this shows that alignment can be circumvented by changes that leave no trace in the discrete token sequence.

Core claim

Prompt Embedding Optimization (PEO) performs gradient-based optimization in the continuous embedding space of the original prompt tokens, using structured continuation targets and an adaptive failure-focused schedule across multiple rounds. The resulting embeddings lie close enough to their starting points that nearest-token projection restores the exact original prompt string, yet the approach yields higher attack success rates than competing white-box methods that rely on appended discrete suffixes or search-based generation.

What carries the argument

Prompt Embedding Optimization (PEO), a gradient-driven process that directly perturbs the embeddings of existing prompt tokens in continuous space rather than appending new adversarial tokens.

If this is right

PEO records higher attack success rates than discrete suffix search, appended adversarial embeddings, and search-based adversarial generation on two standard harmful-behavior benchmarks.
The optimized embeddings remain sufficiently close to the originals that nearest-token projection recovers the exact original prompt string in every case tested.
Quantitative checks show that model responses stay on the original topic for the large majority of prompts despite the embedding shifts.
Later optimization rounds can incorporate heuristic composite response scaffolds that improve performance without producing outputs that are merely scaffold artifacts, according to ASR-Judge evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Jailbreaks could become harder to detect automatically if they leave the token sequence unchanged and only alter internal embeddings.
Alignment training may need additional regularization in embedding space or monitoring of activation patterns rather than relying solely on token-level filters.
Extending the same continuous optimization idea to other safety-critical tasks, such as preventing leakage of private information, could be tested by swapping the harmful targets for privacy targets.
The approach might interact differently with models trained with explicit embedding-space safety constraints, providing a natural next experiment.

Load-bearing premise

The ASR-Judge scores and on-topic quantitative checks reflect genuine semantic preservation and real attack gains rather than artifacts created by the structured continuation targets or composite response scaffolds used in later rounds.

What would settle it

Apply PEO to a held-out set of prompts, project every final embedding to its nearest vocabulary token, and verify whether the recovered text string matches the input prompt exactly while the model still generates the targeted harmful content.

Figures

Figures reproduced from arXiv: 2604.24983 by Benjamin C. M. Fung, Boyang Li, Ebrahim Bagheri, Miles Q. Li, Radin Hamidi Rad.

**Figure 1.** Figure 1: Overview of PEO against token-appending attacks. Top: a representative token-appending attack (nanoGCG) appends visible adversarial tokens. Bottom: PEO perturbs the embeddings of existing prompt tokens (green glow), preserving the visible text exactly, with adaptive multi-round scheduling. to the specific query and designed to flow directly into the requested harmful content. The attacker has access to the… view at source ↗

read the original abstract

Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEO shows a workable way to optimize original prompt embeddings directly for jailbreaks without visible changes, but the gains may come more from the structured targets and scaffolds than the embedding tweaks themselves.

read the letter

The main thing to know is that this paper optimizes the continuous embeddings of the original prompt tokens in a multi-round white-box attack, without appending any extra tokens, and reports higher attack success rates than discrete suffix methods or appended embedding attacks on two harmful-behavior benchmarks. They keep the visible prompt text intact after nearest-token projection and show responses stay on topic for most cases.

Referee Report

2 major / 2 minor

Summary. The paper proposes Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the continuous embeddings of the original prompt tokens (without appending adversarial suffixes or tokens). It combines this with structured continuation targets and an adaptive failure-focused schedule, claiming that the optimized embeddings remain sufficiently close to originals for exact nearest-token projection (preserving the visible prompt string) while achieving higher attack success rates than prior white-box methods (discrete suffix search, appended adversarial embeddings, search-based generation) on two standard harmful-behavior benchmarks. Quantitative on-topic analysis and ASR-Judge evaluations are reported to support semantic preservation and that gains are not scaffold-only artifacts.

Significance. If the central empirical claims hold after addressing attribution concerns, the work would be significant for demonstrating that direct embedding-space optimization can yield effective, low-visibility jailbreaks without destroying prompt semantics, contrary to prior assumptions in the field. This provides a new attack vector and could inform alignment research, though the current evidence for isolating the contribution of embedding optimization is limited.

major comments (2)

[§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.
[§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.

minor comments (2)

[§3] The abstract and §3 mention 'nearest-token projection' but do not specify the exact distance metric or projection procedure used to confirm exact string preservation; a short algorithmic description or pseudocode would improve reproducibility.
[Table 1] Table 1 (or equivalent benchmark results table) lacks error bars, number of runs, or statistical significance tests for the reported ASR improvements; adding these would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify limitations in how we currently isolate the contribution of embedding optimization from the other components of PEO. We address each point below and will revise the manuscript with additional ablations and controls to strengthen attribution.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.

Authors: We agree that the current experimental design does not fully isolate the embedding optimization. The targets and scaffolds are integral to PEO's multi-round process, and the baselines follow their original formulations without them. In the revision we will add two new sets of experiments: (1) applying the identical structured continuation targets and composite scaffolds to the discrete-suffix and search-based baselines, and (2) an ablation of PEO that disables the adaptive schedule while retaining only the embedding optimization. These results will be reported alongside the existing comparisons to clarify the specific contribution of continuous embedding perturbation. revision: yes
Referee: [§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.

Authors: We accept this critique. The on-topic and ASR-Judge metrics currently lack explicit scaffold-only controls. We will add, in the revised §4.3, evaluations that apply the same composite scaffolds to the original (non-optimized) prompts and to the baseline methods, reporting both on-topic rates and ASR-Judge scores for these conditions. This will demonstrate that the high on-topic rates arise from the optimized embeddings remaining sufficiently close to the originals for exact nearest-token projection, rather than from the scaffolds alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper proposes an empirical jailbreak technique (PEO) that optimizes prompt embeddings while using continuation targets and an adaptive schedule, then validates it via direct comparisons to prior white-box attacks on standard benchmarks. No equations, derivations, or self-referential predictions appear in the abstract or description. Performance claims rest on experimental results rather than any reduction of outputs to fitted inputs or self-citations by construction. The work is self-contained against external baselines, consistent with the reader's assessment of score 2.0 for minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical contribution relying on standard gradient-based optimization and LLM evaluation practices; no free parameters, axioms, or invented entities are introduced or required beyond those in the referenced prior attacks.

pith-pipeline@v0.9.0 · 5516 in / 1246 out tokens · 31772 ms · 2026-05-08T03:24:29.751657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W., 2018. Generating natural language adversarial exam- ples, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin- guistics. pp. 2890–2896

2018
[2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review arXiv 2022
[3]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems

Chao,P.,Debenedetti,E.,Robey,A.,Andriushchenko,M.,Croce,F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramèr, F., Hassani, H., Wong, E., 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems

2024
[4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E., 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

work page internal anchor Pith review arXiv 2023
[5]

Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/

Chiang,W.L.,Li,Z.,Lin,Z.,Sheng,Y.,Wu,Z.,Zhang,H.,Zheng,L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P., 2023. Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/

2023
[6]

Ebrahimi, J., Rao, A., Lowd, D., Dou, D., 2018. Hotflip: White- box adversarial examples for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), Association for Computational Linguistics. pp. 31–36

2018
[7]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al- Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.,
[8]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review arXiv
[9]

nanogcg.https://github.com/GraySwanAI/ nanoGCG

Gray Swan AI, 2024. nanogcg.https://github.com/GraySwanAI/ nanoGCG

2024
[10]

Iyyer, M., Wieting, J., Gimpel, K., Zettlemoyer, L., 2018. Adversar- ial example generation with syntactically controlled paraphrase net- works,in:Proceedingsofthe2018ConferenceoftheNorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics. pp. 1875–1885

2018
[11]

arXiv preprint arXiv:2501.18280

Liang,H.,Sun,Y.,Cai,Y.,Zhu,J.,Zhang,B.,2025.Jailbreakingllms’ safeguard with universal magic words for text embedding models. arXiv preprint arXiv:2501.18280

work page arXiv 2025
[12]

Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H., 2024a. Jailjudge: A comprehensive jailbreak judge benchmark with multi- agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855

work page arXiv
[13]

Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations

Liu, X., Xu, N., Chen, M., Xiao, C., 2024b. Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations
[14]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D., 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

work page internal anchor Pith review arXiv 2024
[15]

Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A., 2024. Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems

2024
[16]

Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates

Ouyang,L.,Wu,J.,Jiang,X.,Almeida,D.,Wainwright,C.,Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates. pp. 27730–27744

2022
[17]

arXiv preprint arXiv:2412.03876

Peng,J.,Tang,Z.,Liu,G.,Fleming,C.,Hong,M.,2024.Safeguarding text-to-image generation via inference-time prompt-noise optimiza- tion. arXiv preprint arXiv:2412.03876

work page arXiv 2024
[18]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn,C.,2023. Directpreferenceoptimization:Yourlanguagemodel is secretly a reward model. arXiv preprint arXiv:2305.18290

work page internal anchor Pith review arXiv 2023
[19]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, A., Wong, E., Hassani, H., Pappas, G.J., 2024. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684

work page internal anchor Pith review arXiv 2024
[20]

Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR

Sadasivan, V.S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., Feizi, S., 2024. Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR

2024
[21]

C., Lupu, A., Hambro, E., Markosyan, A

Samvelyan, M., Raparthy, S.C., Lupu, A., Hambro, E., Markosyan, A.H., Bhatt, M., Tian, Y., Jiang, E., Raileanu, R., Rocktäschel, T., Whiteson, S., 2024. Rainbow teaming: Open-ended generation of diverse adversarial prompts. arXiv preprint arXiv:2402.16822

work page arXiv 2024
[22]

Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., Gunnemann, S.,
[23]

Softpromptthreats:Attackingsafetyalignmentandunlearning in open-source LLMs through the embedding space, in: Advances in Neural Information Processing Systems
[24]

Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S., 2020. Autoprompt: Eliciting knowledge from language models with auto- maticallygeneratedprompts,in:Proceedingsofthe2020Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics. pp. 4222–4235

2020
[25]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. Llama2:Openfoundationandfine-tunedchatmodels. arXivpreprint arXiv:2307.09288

work page internal anchor Pith review arXiv 2023
[26]

Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics

Wang,Y.,Wang,P.,Xi,C.,Tang,B.,Zhu,J.,Wei,W.,Chen,C.,Yang, C.,Zhang,J.,Lu,C.,Niu,Y.,Mao,K.,Li,Z.,Xiong,F.,Hu,J.,Yang, M., 2025. Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics

2025
[27]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery

Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T., 2023. Hard prompts made easy: Gradient-based dis- crete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668

work page arXiv 2023
[28]

Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models

Xu, Z., Liu, Y., Deng, G., Wang, K., Li, Y., Shi, L., Picek, S., 2024. Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models. arXiv preprint arXiv:2407.13796

work page arXiv 2024
[29]

Qwen3 Technical Report

Yang,A.,Li,A.,Yang,B.,Zhang,B.,Hui,B.,Zheng,B.,Yu,B.,Gao, C., Huang, C., Lv, C., et al., 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 . M. Q. Li et al.:Preprint submitted to ElsevierPage 12 of 13 Adaptive Prompt Embedding Optimization for LLM Jailbreaking

work page internal anchor Pith review arXiv 2025
[30]

Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates

Zhu, S., Amos, B., Tian, Y., Guo, C., Evtimov, I., 2025. Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates

2025
[31]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M., 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 . A. Judge Pipeline and Agreement ASR-Judge uses two judges: GPT-5.4 and Claude Opus 4.6. We classify each final response independently with both judges; a response is coun...

work page internal anchor Pith review arXiv 2023
[32]

JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge

uses a single fine-tuned Llama-2-13B classifier, and JailbreakBench [3] likewise chooses a single default judge, Llama-3-70B, specifically because of its strong agreement with experts and relatively low false-positive rate. JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge. None of these t...