pith. machine review for the scientific record. sign in

arxiv: 2604.24983 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

Benjamin C. M. Fung, Boyang Li, Ebrahim Bagheri, Miles Q. Li, Radin Hamidi Rad

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM jailbreakingprompt embedding optimizationwhite-box attacksadversarial embeddingscontinuous optimizationharmful behavior benchmarkssemantic preservation
0
0 comments X

The pith

Directly optimizing prompt token embeddings enables stronger jailbreaks against aligned LLMs without changing the visible prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a multi-round process can adjust the continuous embeddings of every token already present in a user prompt to elicit harmful outputs from LLMs. Because the perturbations stay small, projecting each embedding back to its nearest vocabulary token recovers the original prompt string exactly. The method adds structured continuation targets and an adaptive schedule that prioritizes previously failed prompts, producing higher success rates than prior white-box techniques on two standard harmful-behavior benchmarks. Responses remain on the intended topic in the large majority of cases, indicating that semantic content is largely preserved even though the internal representations have been shifted. If correct, this shows that alignment can be circumvented by changes that leave no trace in the discrete token sequence.

Core claim

Prompt Embedding Optimization (PEO) performs gradient-based optimization in the continuous embedding space of the original prompt tokens, using structured continuation targets and an adaptive failure-focused schedule across multiple rounds. The resulting embeddings lie close enough to their starting points that nearest-token projection restores the exact original prompt string, yet the approach yields higher attack success rates than competing white-box methods that rely on appended discrete suffixes or search-based generation.

What carries the argument

Prompt Embedding Optimization (PEO), a gradient-driven process that directly perturbs the embeddings of existing prompt tokens in continuous space rather than appending new adversarial tokens.

If this is right

  • PEO records higher attack success rates than discrete suffix search, appended adversarial embeddings, and search-based adversarial generation on two standard harmful-behavior benchmarks.
  • The optimized embeddings remain sufficiently close to the originals that nearest-token projection recovers the exact original prompt string in every case tested.
  • Quantitative checks show that model responses stay on the original topic for the large majority of prompts despite the embedding shifts.
  • Later optimization rounds can incorporate heuristic composite response scaffolds that improve performance without producing outputs that are merely scaffold artifacts, according to ASR-Judge evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Jailbreaks could become harder to detect automatically if they leave the token sequence unchanged and only alter internal embeddings.
  • Alignment training may need additional regularization in embedding space or monitoring of activation patterns rather than relying solely on token-level filters.
  • Extending the same continuous optimization idea to other safety-critical tasks, such as preventing leakage of private information, could be tested by swapping the harmful targets for privacy targets.
  • The approach might interact differently with models trained with explicit embedding-space safety constraints, providing a natural next experiment.

Load-bearing premise

The ASR-Judge scores and on-topic quantitative checks reflect genuine semantic preservation and real attack gains rather than artifacts created by the structured continuation targets or composite response scaffolds used in later rounds.

What would settle it

Apply PEO to a held-out set of prompts, project every final embedding to its nearest vocabulary token, and verify whether the recovered text string matches the input prompt exactly while the model still generates the targeted harmful content.

Figures

Figures reproduced from arXiv: 2604.24983 by Benjamin C. M. Fung, Boyang Li, Ebrahim Bagheri, Miles Q. Li, Radin Hamidi Rad.

Figure 1
Figure 1. Figure 1: Overview of PEO against token-appending attacks. Top: a representative token-appending attack (nanoGCG) appends visible adversarial tokens. Bottom: PEO perturbs the embeddings of existing prompt tokens (green glow), preserving the visible text exactly, with adaptive multi-round scheduling. to the specific query and designed to flow directly into the requested harmful content. The attacker has access to the… view at source ↗
read the original abstract

Existing white-box jailbreak attacks against aligned LLMs typically append discrete adversarial suffixes to the user prompt, which visibly alters the prompt and operates in a combinatorial token space. Prior work has avoided directly optimizing the embeddings of the original prompt tokens, presumably because perturbing them risks destroying the prompt's semantic content. We propose Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the embeddings of the original prompt tokens without appending any adversarial tokens, and show that the concern is unfounded: the optimized embeddings remain close enough to their originals that the visible prompt string is preserved exactly after nearest-token projection, and quantitative analysis shows the model's responses stay on topic for the large majority of prompts. PEO combines continuous embedding-space optimization with structured continuation targets and an adaptive failure-focused schedule. Counterintuitively, later PEO rounds can benefit from heuristic composite response scaffolds that are not natural standalone templates, yet ASR-Judge shows that the resulting gains are not merely empty formatting or scaffold-only outputs. Across two standard harmful-behavior benchmarks and competing white-box attacks spanning discrete suffix search, appended adversarial embeddings, and search-based adversarial generation, PEO outperforms all of them in our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Prompt Embedding Optimization (PEO), a multi-round white-box jailbreak that directly optimizes the continuous embeddings of the original prompt tokens (without appending adversarial suffixes or tokens). It combines this with structured continuation targets and an adaptive failure-focused schedule, claiming that the optimized embeddings remain sufficiently close to originals for exact nearest-token projection (preserving the visible prompt string) while achieving higher attack success rates than prior white-box methods (discrete suffix search, appended adversarial embeddings, search-based generation) on two standard harmful-behavior benchmarks. Quantitative on-topic analysis and ASR-Judge evaluations are reported to support semantic preservation and that gains are not scaffold-only artifacts.

Significance. If the central empirical claims hold after addressing attribution concerns, the work would be significant for demonstrating that direct embedding-space optimization can yield effective, low-visibility jailbreaks without destroying prompt semantics, contrary to prior assumptions in the field. This provides a new attack vector and could inform alignment research, though the current evidence for isolating the contribution of embedding optimization is limited.

major comments (2)
  1. [§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.
  2. [§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.
minor comments (2)
  1. [§3] The abstract and §3 mention 'nearest-token projection' but do not specify the exact distance metric or projection procedure used to confirm exact string preservation; a short algorithmic description or pseudocode would improve reproducibility.
  2. [Table 1] Table 1 (or equivalent benchmark results table) lacks error bars, number of runs, or statistical significance tests for the reported ASR improvements; adding these would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify limitations in how we currently isolate the contribution of embedding optimization from the other components of PEO. We address each point below and will revise the manuscript with additional ablations and controls to strengthen attribution.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Adaptive Schedule): The reported outperformance of PEO over baselines is measured under the same structured continuation targets and heuristic composite response scaffolds used during optimization. No ablation is presented that holds targets/scaffolds fixed while applying only the embedding perturbation (or that applies identical scaffolds to the discrete-suffix and search-based baselines). This makes it impossible to attribute the ASR gains specifically to the embedding optimization rather than the multi-round schedule or scaffolds, which is load-bearing for the claim that 'the concern [about semantic destruction] is unfounded' and that PEO is a distinct embedding-space attack.

    Authors: We agree that the current experimental design does not fully isolate the embedding optimization. The targets and scaffolds are integral to PEO's multi-round process, and the baselines follow their original formulations without them. In the revision we will add two new sets of experiments: (1) applying the identical structured continuation targets and composite scaffolds to the discrete-suffix and search-based baselines, and (2) an ablation of PEO that disables the adaptive schedule while retaining only the embedding optimization. These results will be reported alongside the existing comparisons to clarify the specific contribution of continuous embedding perturbation. revision: yes

  2. Referee: [§4.3] §4.3 (On-topic Analysis) and ASR-Judge description: The quantitative on-topic metric and ASR-Judge results are evaluated on responses generated under the same composite scaffolds and continuation targets. Without a control condition that applies the scaffolds to non-PEO prompts or measures on-topic rates for scaffold-only baselines, it remains unclear whether the reported semantic preservation and non-trivial gains are artifacts of the evaluation setup rather than properties of the optimized embeddings.

    Authors: We accept this critique. The on-topic and ASR-Judge metrics currently lack explicit scaffold-only controls. We will add, in the revised §4.3, evaluations that apply the same composite scaffolds to the original (non-optimized) prompts and to the baseline methods, reporting both on-topic rates and ASR-Judge scores for these conditions. This will demonstrate that the high on-topic rates arise from the optimized embeddings remaining sufficiently close to the originals for exact nearest-token projection, rather than from the scaffolds alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmarks

full rationale

The paper proposes an empirical jailbreak technique (PEO) that optimizes prompt embeddings while using continuation targets and an adaptive schedule, then validates it via direct comparisons to prior white-box attacks on standard benchmarks. No equations, derivations, or self-referential predictions appear in the abstract or description. Performance claims rest on experimental results rather than any reduction of outputs to fitted inputs or self-citations by construction. The work is self-contained against external baselines, consistent with the reader's assessment of score 2.0 for minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical contribution relying on standard gradient-based optimization and LLM evaluation practices; no free parameters, axioms, or invented entities are introduced or required beyond those in the referenced prior attacks.

pith-pipeline@v0.9.0 · 5516 in / 1246 out tokens · 31772 ms · 2026-05-08T03:24:29.751657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.J., Srivastava, M., Chang, K.W., 2018. Generating natural language adversarial exam- ples, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin- guistics. pp. 2890–2896

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

  3. [3]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems

    Chao,P.,Debenedetti,E.,Robey,A.,Andriushchenko,M.,Croce,F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramèr, F., Hassani, H., Wong, E., 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models, in: Advances in Neural Information Processing Systems

  4. [4]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E., 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

  5. [5]

    Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/

    Chiang,W.L.,Li,Z.,Lin,Z.,Sheng,Y.,Wu,Z.,Zhang,H.,Zheng,L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P., 2023. Vicuna:Anopen-sourcechatbotimpressinggpt-4with90%*chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna/

  6. [6]

    Ebrahimi, J., Rao, A., Lowd, D., Dou, D., 2018. Hotflip: White- box adversarial examples for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), Association for Computational Linguistics. pp. 31–36

  7. [7]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al- Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.,

  8. [8]

    The Llama 3 Herd of Models

    The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  9. [9]

    nanogcg.https://github.com/GraySwanAI/ nanoGCG

    Gray Swan AI, 2024. nanogcg.https://github.com/GraySwanAI/ nanoGCG

  10. [10]

    Iyyer, M., Wieting, J., Gimpel, K., Zettlemoyer, L., 2018. Adversar- ial example generation with syntactically controlled paraphrase net- works,in:Proceedingsofthe2018ConferenceoftheNorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics. pp. 1875–1885

  11. [11]

    arXiv preprint arXiv:2501.18280

    Liang,H.,Sun,Y.,Cai,Y.,Zhu,J.,Zhang,B.,2025.Jailbreakingllms’ safeguard with universal magic words for text embedding models. arXiv preprint arXiv:2501.18280

  12. [12]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

    Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H., 2024a. Jailjudge: A comprehensive jailbreak judge benchmark with multi- agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855

  13. [13]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations

    Liu, X., Xu, N., Chen, M., Xiao, C., 2024b. Autodan: Generating stealthy jailbreak prompts on aligned large language models, in: The Twelfth International Conference on Learning Representations

  14. [14]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., Hendrycks, D., 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

  15. [15]

    Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems

    Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A., 2024. Tree of attacks: Jailbreaking black-box llms automatically, in: Advances in Neural Information Processing Systems

  16. [16]

    Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates

    Ouyang,L.,Wu,J.,Jiang,X.,Almeida,D.,Wainwright,C.,Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training languagemodelstofollowinstructionswithhumanfeedback,in:Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates. pp. 27730–27744

  17. [17]

    arXiv preprint arXiv:2412.03876

    Peng,J.,Tang,Z.,Liu,G.,Fleming,C.,Hong,M.,2024.Safeguarding text-to-image generation via inference-time prompt-noise optimiza- tion. arXiv preprint arXiv:2412.03876

  18. [18]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn,C.,2023. Directpreferenceoptimization:Yourlanguagemodel is secretly a reward model. arXiv preprint arXiv:2305.18290

  19. [19]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Robey, A., Wong, E., Hassani, H., Pappas, G.J., 2024. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684

  20. [20]

    Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR

    Sadasivan, V.S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., Feizi, S., 2024. Fast adversarial attacks on language models in one gpu minute, in: Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR

  21. [21]

    C., Lupu, A., Hambro, E., Markosyan, A

    Samvelyan, M., Raparthy, S.C., Lupu, A., Hambro, E., Markosyan, A.H., Bhatt, M., Tian, Y., Jiang, E., Raileanu, R., Rocktäschel, T., Whiteson, S., 2024. Rainbow teaming: Open-ended generation of diverse adversarial prompts. arXiv preprint arXiv:2402.16822

  22. [22]

    Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., Gunnemann, S.,

  23. [23]

    Softpromptthreats:Attackingsafetyalignmentandunlearning in open-source LLMs through the embedding space, in: Advances in Neural Information Processing Systems

  24. [24]

    Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S., 2020. Autoprompt: Eliciting knowledge from language models with auto- maticallygeneratedprompts,in:Proceedingsofthe2020Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics. pp. 4222–4235

  25. [25]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al., 2023. Llama2:Openfoundationandfine-tunedchatmodels. arXivpreprint arXiv:2307.09288

  26. [26]

    Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics

    Wang,Y.,Wang,P.,Xi,C.,Tang,B.,Zhu,J.,Wei,W.,Chen,C.,Yang, C.,Zhang,J.,Lu,C.,Niu,Y.,Mao,K.,Li,Z.,Xiong,F.,Hu,J.,Yang, M., 2025. Adversarial preference learning for robust llm alignment, in: Findings of the Association for Computational Linguistics: ACL 2025, Association for Computational Linguistics

  27. [27]

    Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery

    Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping, J., Goldstein, T., 2023. Hard prompts made easy: Gradient-based dis- crete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668

  28. [28]

    Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models

    Xu, Z., Liu, Y., Deng, G., Wang, K., Li, Y., Shi, L., Picek, S., 2024. Continuousembeddingattacksviaclippedinputsinjailbreakinglarge language models. arXiv preprint arXiv:2407.13796

  29. [29]

    Qwen3 Technical Report

    Yang,A.,Li,A.,Yang,B.,Zhang,B.,Hui,B.,Zheng,B.,Yu,B.,Gao, C., Huang, C., Lv, C., et al., 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 . M. Q. Li et al.:Preprint submitted to ElsevierPage 12 of 13 Adaptive Prompt Embedding Optimization for LLM Jailbreaking

  30. [30]

    Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates

    Zhu, S., Amos, B., Tian, Y., Guo, C., Evtimov, I., 2025. Advprefix: An objective for nuanced llm jailbreaks, in: Advances in Neural Information Processing Systems, Curran Associates

  31. [31]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M., 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 . A. Judge Pipeline and Agreement ASR-Judge uses two judges: GPT-5.4 and Claude Opus 4.6. We classify each final response independently with both judges; a response is coun...

  32. [32]

    JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge

    uses a single fine-tuned Llama-2-13B classifier, and JailbreakBench [3] likewise chooses a single default judge, Llama-3-70B, specifically because of its strong agreement with experts and relatively low false-positive rate. JAIL- JUDGE[11]goesfurtherandproposesamulti-agentevalua- tionframeworkwithbelief-fusionratherthanasinglebinary judge. None of these t...