pith. machine review for the scientific record. sign in

arxiv: 2605.08898 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: no theorem link

LLM-Agnostic Semantic Representation Attack

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords adversarial attackslarge language modelssemantic representationsprompt optimizationLLM alignmentblack-box transfercoherence relationship
0
0 comments X

The pith

Optimizing adversarial prompts for malicious semantic meaning rather than exact affirmative phrases guarantees convergence and cross-model transfer in LLM attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts the goal of adversarial prompt crafting from forcing specific response templates to reaching a target malicious semantic representation. It proves a coherence-convergence relationship showing that preserving semantic coherence during optimization produces reliable white-box success and black-box transfer across models. This approach avoids the convergence failures and unnatural outputs common in token-level methods that chase exact strings. The result is demonstrated by a search algorithm that reaches 99.71 percent average success on 26 open-source LLMs while keeping prompts interpretable.

Core claim

By targeting malicious semantic representations instead of exact textual templates and enforcing semantic coherence during discrete token optimization, the method establishes that coherence guarantees both white-box semantic convergence and black-box transferability, operationalized through incremental chunk expansion to achieve 99.71 percent average attack success across 26 LLMs.

What carries the argument

The Semantic Representation Heuristic Search (SRHS) algorithm, which expands adversarial prompts by adding coherent token chunks while preserving overall semantic meaning and structural interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety techniques would need to move from surface-phrase detection to semantic intent monitoring to remain effective.
  • The same coherence principle could be tested for improving non-adversarial prompt engineering tasks such as instruction following.
  • Closed-source models might be evaluated by measuring how well open-model attack prompts transfer when semantic targets are used.

Load-bearing premise

Semantic coherence can be reliably detected and maintained when optimizing prompts in discrete token space so that it produces convergence and transfer.

What would settle it

Applying the coherence-preserving search to a held-out collection of LLMs and finding that high measured semantic coherence does not produce high attack success or transfer would falsify the claimed relationship.

Figures

Figures reproduced from arXiv: 2605.08898 by Jianhong Pan, Jiawei Lian, Lap-Pui Chau, Lefan Wang, Shaohui Mei, Tairan Huang, Yi Wang.

Figure 1
Figure 1. Figure 1: An illustrative example of a jailbreak attack against aligned LLMs. A direct malicious request (top) is securely rejected, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the search space in existing token-level attacks. Rigidly optimizing toward a singular, predefined affirmative [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of vanilla attacks that target textual patterns and our Semantic Representation Attack. Vanilla methods [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Systematic overview of the proposed LLM-Agnostic Semantic Representation Attack framework. The pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Probability distributions in the Semantic Representation Attack framework. Example query [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the adversarial prompt search space distribution trees across different models under a malicious query [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Semantic Representation Attack (SRA), a new LLM-agnostic adversarial prompting paradigm that targets malicious semantic representations instead of exact affirmative text templates. It claims to establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound proving that semantic coherence guarantees white-box convergence and black-box transferability. The SRHS algorithm implements this via incremental discrete token chunk expansion to preserve interpretability, and experiments report a 99.71% average attack success rate across 26 open-source LLMs with strong transferability and stealth.

Significance. If the theoretical relationships and bound are rigorously derived without circularity and the empirical results prove robust, the work could meaningfully advance adversarial attack research on aligned LLMs by moving beyond token-level optimization to semantic-level methods that improve naturalness and cross-model generalization.

major comments (3)
  1. [Theoretical claims] Theoretical section on Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound: the manuscript asserts these relationships are established and proven to guarantee convergence and transferability, yet supplies no equations, proof sketches, or derivation steps, making it impossible to determine whether the results are independent of the SRHS heuristics or rest on unstated assumptions.
  2. [SRHS description and bound] SRHS algorithm and generalization bound: the bound derivation likely relies on continuous or Lipschitz-bounded semantic neighborhood properties (standard for such proofs), but SRHS performs discrete token chunk expansions that can produce semantic jumps outside any coherence neighborhood without detection; this gap directly undermines the claimed guarantee that coherence maintenance ensures transferability.
  3. [Evaluation] Experimental results: the 99.71% average ASR across 26 models is presented without baselines, error bars, statistical tests, or protocol details, so it cannot be assessed whether it validates the theoretical claims or merely reflects heuristic performance.
minor comments (1)
  1. [Abstract] The abstract is dense with new terminology (SRA, SRHS, Coherence-Convergence Relationship) introduced without forward references to where they are formally defined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the presentation of our theoretical and empirical contributions without altering the core claims.

read point-by-point responses
  1. Referee: Theoretical section on Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound: the manuscript asserts these relationships are established and proven to guarantee convergence and transferability, yet supplies no equations, proof sketches, or derivation steps, making it impossible to determine whether the results are independent of the SRHS heuristics or rest on unstated assumptions.

    Authors: We acknowledge that the current manuscript presents the Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound primarily at a conceptual level without the full set of equations or proof sketches. This omission was intended to maintain readability but has hindered verification. The relationships are derived from properties of semantic embedding spaces and coherence metrics, and are designed to be independent of the specific heuristics in SRHS. In the revised version, we will include the key equations defining semantic coherence, the statement of the bound, the main assumptions (e.g., bounded distances in embedding space), and an outline of the derivation steps to allow readers to assess independence from implementation details. revision: yes

  2. Referee: SRHS algorithm and generalization bound: the bound derivation likely relies on continuous or Lipschitz-bounded semantic neighborhood properties (standard for such proofs), but SRHS performs discrete token chunk expansions that can produce semantic jumps outside any coherence neighborhood without detection; this gap directly undermines the claimed guarantee that coherence maintenance ensures transferability.

    Authors: The referee correctly identifies a potential tension between the continuous assumptions typical of generalization bounds and the discrete token expansions performed by SRHS. We will revise the manuscript to clarify how the incremental chunk expansion is constrained to remain within coherence neighborhoods, for example by introducing an explicit coherence threshold and a mechanism to detect and revert jumps. If the bound requires Lipschitz continuity, we will either extend the theoretical analysis to discrete settings or provide supporting analysis showing that observed expansions satisfy the neighborhood condition in practice. This addresses the gap while preserving the claim that maintained coherence supports transferability. revision: partial

  3. Referee: Experimental results: the 99.71% average ASR across 26 models is presented without baselines, error bars, statistical tests, or protocol details, so it cannot be assessed whether it validates the theoretical claims or merely reflects heuristic performance.

    Authors: We agree that the experimental reporting requires additional rigor to substantiate the theoretical claims. The reported average ASR will be supplemented in the revision with comparisons against established baselines (including token-level methods such as GCG), error bars derived from multiple independent runs, appropriate statistical tests for significance, and a complete experimental protocol detailing model versions, evaluation criteria, and hyperparameter settings. These additions will enable clearer assessment of whether the results align with the predicted benefits of semantic coherence for convergence and transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation

full rationale

The paper first states the establishment of the semantic Coherence-Convergence Relationship and derivation of the Cross-Model Semantic Generalization bound as independent theoretical contributions that prove guarantees for convergence and transferability. These precede the description of the SRHS algorithm as an operationalization in discrete token space. No load-bearing step reduces the bound or relationship to a fitted parameter, self-citation chain, or definitional tautology; the reported ASR is presented as empirical outcome rather than a constructed prediction. The derivation chain is therefore self-contained against the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on newly introduced theoretical relationships and a search algorithm whose correctness is asserted rather than derived from prior external results.

axioms (1)
  • ad hoc to paper Semantic Coherence-Convergence Relationship
    Paper establishes this to connect prompt semantic coherence to white-box convergence and black-box transferability.
invented entities (2)
  • Semantic Representation Attack (SRA) no independent evidence
    purpose: Reconceptualize adversarial objectives from exact text to malicious semantic representations.
    Core new framework proposed in the work.
  • Semantic Representation Heuristic Search (SRHS) no independent evidence
    purpose: Preserve interpretability and structural coherence during incremental discrete token chunk expansion.
    New algorithm introduced to operationalize SRA.

pith-pipeline@v0.9.0 · 5511 in / 1471 out tokens · 73807 ms · 2026-05-12T01:09:39.613498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 10 internal anchors

  1. [1]

    Language Models are Few-Shot Learners

    T. B. Brown, “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, 2020

  2. [2]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  3. [3]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  4. [4]

    A survey on multimodal large language models for autonomous driving,

    C. Cui, Y . Ma, X. Cao, W. Ye, Y . Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liaoet al., “A survey on multimodal large language models for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 958– 979

  5. [5]

    Lmdrive: Closed-loop end-to-end driving with large language models,

    H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 120–15 130

  6. [6]

    Language models meet world models: Embodied experiences enhance language models,

    J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,”Advances in neural information processing systems, vol. 36, 2024

  7. [7]

    Large language models as generalizable policies for embodied tasks,

    A. Szot, M. Schwarzer, H. Agrawal, B. Mazoure, R. Metcalf, W. Talbott, N. Mackraz, R. D. Hjelm, and A. T. Toshev, “Large language models as generalizable policies for embodied tasks,” inThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Large language models in medicine,

    A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,”Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023

  9. [9]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohlet al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

  10. [10]

    Scaling instruction-finetuned language models,

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,”Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

  11. [11]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

  12. [12]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  13. [13]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in neural information processing systems, vol. 30, 2017. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

  14. [14]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  15. [15]

    Fast adversarial attacks on language models in one gpu minute,

    V . S. Sadasivan, S. Saha, G. Sriramanan, P. Kattakinda, A. Chegini, and S. Feizi, “Fast adversarial attacks on language models in one gpu minute,” inForty-first International Conference on Machine Learning, 2024

  16. [16]

    Does refusal training in llms generalize to the past tense?

    M. Andriushchenko and N. Flammarion, “Does refusal training in llms generalize to the past tense?” inThe Thirteenth International Conference on Learning Representations, 2025, pp. 85 406–85 420

  17. [17]

    Obscure but effective: Classical chinese jailbreak prompt optimization via bio-inspired search,

    X. Huang, S. Qin, X. Jia, R. Duan, H. Yan, Z. Zeng, F. Yang, Y . Liu, and X. Jia, “Obscure but effective: Classical chinese jailbreak prompt optimization via bio-inspired search,” inThe Fourteenth International Conference on Learning Representations, 2026

  18. [18]

    Improved techniques for optimization-based jailbreaking on large language models,

    X. Jia, T. Pang, C. Du, Y . Huang, J. Gu, Y . Liu, X. Cao, and M. Lin, “Improved techniques for optimization-based jailbreaking on large language models,” inThe Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    Intriguing properties of neural networks,

    C. Szegedy, “Intriguing properties of neural networks,” inInternational Conference on Learning Representations, 2014

  20. [20]

    Explaining and harnessing adversarial examples,

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inInternational Conference on Learning Repre- sentations, 2015

  21. [21]

    Defense against adversarial attacks using high-level representation guided denoiser,

    F. Liao, M. Liang, Y . Dong, T. Pang, X. Hu, and J. Zhu, “Defense against adversarial attacks using high-level representation guided denoiser,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1778–1787

  22. [22]

    Ensemble adversarial training: Attacks and defenses,

    F. Tram `er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in International Conference on Learning Representations, 2018

  23. [23]

    Adversarial attacks and defenses for large language models (llms): methods, frameworks & challenges,

    P. Kumar, “Adversarial attacks and defenses for large language models (llms): methods, frameworks & challenges,”International Journal of Multimedia Information Retrieval, vol. 13, no. 3, p. 26, 2024

  24. [24]

    Adversarial attacks on large language models,

    J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inInternational Conference on Knowledge Science, Engineer- ing and Management. Springer, 2024, pp. 85–96

  25. [25]

    Abu- Ghazaleh

    E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

  26. [26]

    Ignore previous prompt: Attack techniques for language models,

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” inNeurIPS ML Safety Workshop, 2022

  27. [27]

    Does refusal training in llms generalize to the past tense?

    M. Andriushchenko and N. Flammarion, “Does refusal training in llms generalize to the past tense?”arXiv preprint arXiv:2407.11969, 2024

  28. [28]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models,

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,” inThe Twelfth International Conference on Learning Representations, 2024

  29. [29]

    Autodan: Interpretable gradient-based ad- versarial attacks on large language models,

    S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “Autodan: Interpretable gradient-based ad- versarial attacks on large language models,” inFirst Conference on Language Modeling, 2024

  30. [30]

    Z. S. Harris and Z. S. Harris,Co-occurrence and transformation in linguistic structure. Springer, 1970

  31. [31]

    Paraphrasing with bilingual parallel corpora,

    C. Bannard and C. Callison-Burch, “Paraphrasing with bilingual parallel corpora,” inProceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 597–604

  32. [32]

    Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  33. [33]

    Semantic representation attack against aligned large language models,

    J. Lian, J. Pan, L. Wang, Y . Wang, S. Mei, and L.-P. Chau, “Semantic representation attack against aligned large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025

  34. [34]

    Diffusion models for imperceptible and transferable adversarial attack,

    J. Chen, H. Chen, K. Chen, Y . Zhang, Z. Zou, and Z. Shi, “Diffusion models for imperceptible and transferable adversarial attack,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 961–977, 2024

  35. [35]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018

  36. [36]

    Adversarial examples in the physical world,

    A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” inArtificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112

  37. [37]

    Towards evaluating the robustness of neural networks,

    N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57

  38. [38]

    Physical adversarial attacks for surveillance: A survey,

    K. Nguyen, T. Fernando, C. Fookes, and S. Sridharan, “Physical adversarial attacks for surveillance: A survey,”IEEE Transactions on Neural Networks and Learning Systems, 2023

  39. [39]

    Fooling automated surveil- lance cameras: adversarial patches to attack person detection,

    S. Thys, W. Van Ranst, and T. Goedem ´e, “Fooling automated surveil- lance cameras: adversarial patches to attack person detection,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0

  40. [40]

    Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,

    M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” inProceedings of the 2016 acm sigsac conference on computer and communications security, 2016, pp. 1528–1540

  41. [41]

    Cba: Contextual background attack against optical aerial detection in the physical world,

    J. Lian, X. Wang, Y . Su, M. Ma, and S. Mei, “Cba: Contextual background attack against optical aerial detection in the physical world,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 16, 2023

  42. [42]

    Unified adversarial patch for visible-infrared cross-modal attacks in the physical world,

    X. Wei, Y . Huang, Y . Sun, and J. Yu, “Unified adversarial patch for visible-infrared cross-modal attacks in the physical world,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2348–2363, 2023

  43. [43]

    Blackboxbench: A comprehensive benchmark of black-box adversarial attacks,

    M. Zheng, X. Yan, Z. Zhu, H. Chen, and B. Wu, “Blackboxbench: A comprehensive benchmark of black-box adversarial attacks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 7867–7885, 2025

  44. [44]

    a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \

    S. Geisler, T. Wollschl ¨ager, M. Abdalla, J. Gasteiger, and S. G¨unnemann, “Attacking large language models with projected gradient descent,” arXiv preprint arXiv:2402.09154, 2024

  45. [45]

    Gradient-based adversarial attacks against text transformers,

    C. Guo, A. Sablayrolles, H. J ´egou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 5747–5757

  46. [46]

    Semantic equivalent adversarial data augmentation for visual question answering,

    R. Tang, C. Ma, W. E. Zhang, Q. Wu, and X. Yang, “Semantic equivalent adversarial data augmentation for visual question answering,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer, 2020, pp. 437–453

  47. [47]

    Is bert really robust? a strong baseline for natural language attack on text classification and entailment,

    D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, 2020, pp. 8018–8025

  48. [48]

    Adversarial attack on sentiment classification,

    Y .-T. Tsai, M.-C. Yang, and H.-Y . Chen, “Adversarial attack on sentiment classification,” inProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, 2019, pp. 233–240

  49. [49]

    Jailbroken: How does llm safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, 2024

  50. [50]

    ” do anything now

    X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

  51. [51]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”Proceedings of Machine Learning Research, vol. 235, pp. 35 181–35 224, 2024

  52. [52]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,”arXiv preprint arXiv:2310.08419, 2023

  53. [53]

    Tree of attacks: Jailbreaking black-box llms automatically,

    A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically,”Advances in Neural Information Processing Systems, vol. 37, pp. 61 065–61 105, 2024

  54. [54]

    Does safety training of llms generalize to semantically related natural prompts?

    S. Addepalli, Y . Varun, A. Suggala, K. Shanmugam, and P. Jain, “Does safety training of llms generalize to semantically related natural prompts?” inThe Thirteenth International Conference on Learning Representations, 2025

  55. [55]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks,

    M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks,” inThe Thirteenth International Conference on Learning Representations, 2025

  56. [56]

    Are aligned neural networks adversarially aligned?

    N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?”Advances in Neural Information Processing Systems, vol. 36, 2024

  57. [57]

    Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,

    Y . Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geiping, and T. Gold- stein, “Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 008–51 025, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

  58. [58]

    Universal adversarial triggers for attacking and analyzing nlp,

    E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

  59. [59]

    Logan IV, Eric Wallace, and Sameer Singh

    T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Auto- prompt: Eliciting knowledge from language models with automatically generated prompts,”arXiv preprint arXiv:2010.15980, 2020

  60. [60]

    Red teaming language models with language models,

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448

  61. [61]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,

    Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 14 322–14 350

  62. [62]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  63. [63]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  64. [64]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  65. [65]

    Baichuan 2: Open large-scale language models,

    A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang, D. Yanet al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023

  66. [66]

    Koala: A dialogue model for academic research,

    X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research,” Blog post, April 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04/ 03/koala/

  67. [67]

    Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

    A. Mitra, L. Del Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwalet al., “Orca 2: Teaching small language models how to reason,”arXiv preprint arXiv:2311.11045, 2023

  68. [68]

    Zephyr: Direct distillation of lm alignment,

    L. Tunstall, E. E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. V on Werra, C. Fourrier, N. Habibet al., “Zephyr: Direct distillation of lm alignment,” inFirst Conference on Language Modeling, 2024

  69. [69]

    Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling,

    S. Kim, D. Kim, C. Park, W. Lee, W. Song, Y . Kim, H. Kim, Y . Kim, H. Lee, J. Kimet al., “Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), 2024...

  70. [70]

    Openchat: Advancing open-source language models with mixed-quality data,

    G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y . Liu, “Openchat: Advancing open-source language models with mixed-quality data,” in The Twelfth International Conference on Learning Representations, 2024

  71. [71]

    Qlora: Efficient finetuning of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Information Processing Systems, vol. 36, 2024

  72. [72]

    Mistral 7B

    A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b. arxiv 2023,”arXiv preprint arXiv:2310.06825, 2023

  73. [73]

    Starling-7b: Improving llm helpfulness & harmlessness with rlaif,

    B. Zhu, E. Frick, T. Wu, H. Zhu, and J. Jiao, “Starling-7b: Improving llm helpfulness & harmlessness with rlaif,” November 2023

  74. [74]

    Smoothllm: Defend- ing large language models against jailbreaking attacks,

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defend- ing large language models against jailbreaking attacks,”Transactions on Machine Learning Research, 2025

  75. [75]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,”arXiv preprint arXiv:2309.10253, 2023

  76. [76]

    Purple llama CyberSecEval : A secure coding benchmark for language models

    M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Purple llama cyberseceval: A secure coding benchmark for language models,”arXiv preprint arXiv:2312.04724, 2023

  77. [77]

    Jailbreaking black box large language models in twenty queries,

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42. Jiawei Lianis pursuing dual Ph.D. degrees through the joint Ph.D. program between Northwestern Polytechnical University (...

  78. [78]

    Her research interests include AI for science, AI safety, and few-shot learning

    She is currently a postdoctoral researcher at the Hong Kong Institute of AI for Science, City University of Hong Kong, Hong Kong SAR. Her research interests include AI for science, AI safety, and few-shot learning. Yi Wang(Member, IEEE) received the B.Eng. degree in electronic informa- tion engineering and the M.Eng. degree in information and signal proce...

  79. [79]

    = exp  − 1 |c|+|y ∗ 1| |c|+|y∗ 1 |X i=1 logP(t i|t<i)   = exp  − 1 |c|+|y ∗ 1|   |c|X i=1 logP(c i|c<i) + |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j)     . (23) Given thatH PPL(c⊕y ∗ 1)< τ, we can derive: exp  − 1 |c|+|y ∗ 1|   |c|X i=1 logP(c i|c<i) + |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j)     < τ − |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j) − |c...

  80. [80]

    = mX i=1 log Pθ(y∗ 1,i|c, y∗ 1,<i) Pθ(y∗ 2,i|c, y∗ 2,<i) .(27) This sequence log-probability ratio quantifies token-level differences between two sequences in autoregressive con- texts. This formulation is particularly appropriate for language models because it captures the cumulative divergence in the model’s predictive behavior when generating semantica...

Showing first 80 references.