arxiv: 2605.08898 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: no theorem link

LLM-Agnostic Semantic Representation Attack

Jiawei Lian , Jianhong Pan , Lefan Wang , Yi Wang , Tairan Huang , Shaohui Mei , Lap-Pui Chau

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords adversarial attackslarge language modelssemantic representationsprompt optimizationLLM alignmentblack-box transfercoherence relationship

0 comments

The pith

Optimizing adversarial prompts for malicious semantic meaning rather than exact affirmative phrases guarantees convergence and cross-model transfer in LLM attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts the goal of adversarial prompt crafting from forcing specific response templates to reaching a target malicious semantic representation. It proves a coherence-convergence relationship showing that preserving semantic coherence during optimization produces reliable white-box success and black-box transfer across models. This approach avoids the convergence failures and unnatural outputs common in token-level methods that chase exact strings. The result is demonstrated by a search algorithm that reaches 99.71 percent average success on 26 open-source LLMs while keeping prompts interpretable.

Core claim

By targeting malicious semantic representations instead of exact textual templates and enforcing semantic coherence during discrete token optimization, the method establishes that coherence guarantees both white-box semantic convergence and black-box transferability, operationalized through incremental chunk expansion to achieve 99.71 percent average attack success across 26 LLMs.

What carries the argument

The Semantic Representation Heuristic Search (SRHS) algorithm, which expands adversarial prompts by adding coherent token chunks while preserving overall semantic meaning and structural interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety techniques would need to move from surface-phrase detection to semantic intent monitoring to remain effective.
The same coherence principle could be tested for improving non-adversarial prompt engineering tasks such as instruction following.
Closed-source models might be evaluated by measuring how well open-model attack prompts transfer when semantic targets are used.

Load-bearing premise

Semantic coherence can be reliably detected and maintained when optimizing prompts in discrete token space so that it produces convergence and transfer.

What would settle it

Applying the coherence-preserving search to a held-out collection of LLMs and finding that high measured semantic coherence does not produce high attack success or transfer would falsify the claimed relationship.

Figures

Figures reproduced from arXiv: 2605.08898 by Jianhong Pan, Jiawei Lian, Lap-Pui Chau, Lefan Wang, Shaohui Mei, Tairan Huang, Yi Wang.

**Figure 1.** Figure 1: An illustrative example of a jailbreak attack against aligned LLMs. A direct malicious request (top) is securely rejected, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the search space in existing token-level attacks. Rigidly optimizing toward a singular, predefined affirmative [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of vanilla attacks that target textual patterns and our Semantic Representation Attack. Vanilla methods [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Systematic overview of the proposed LLM-Agnostic Semantic Representation Attack framework. The pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Probability distributions in the Semantic Representation Attack framework. Example query [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the adversarial prompt search space distribution trees across different models under a malicious query [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shifts adversarial attacks to semantic representations via SRHS chunk expansion and claims near-perfect success plus theoretical guarantees, but the math details are missing from the abstract.

read the letter

The paper's main move is to target malicious semantic representations instead of forcing exact affirmative token templates. Their SRHS algorithm grows prompts through discrete chunk additions while trying to hold coherence and naturalness, which they say explains the transfer and stealth results. Testing across 26 open-source models and reporting 99.71% average attack success rate gives a concrete empirical anchor that stands out from narrower evaluations in some prior work.

Referee Report

3 major / 1 minor

Summary. The paper proposes Semantic Representation Attack (SRA), a new LLM-agnostic adversarial prompting paradigm that targets malicious semantic representations instead of exact affirmative text templates. It claims to establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound proving that semantic coherence guarantees white-box convergence and black-box transferability. The SRHS algorithm implements this via incremental discrete token chunk expansion to preserve interpretability, and experiments report a 99.71% average attack success rate across 26 open-source LLMs with strong transferability and stealth.

Significance. If the theoretical relationships and bound are rigorously derived without circularity and the empirical results prove robust, the work could meaningfully advance adversarial attack research on aligned LLMs by moving beyond token-level optimization to semantic-level methods that improve naturalness and cross-model generalization.

major comments (3)

[Theoretical claims] Theoretical section on Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound: the manuscript asserts these relationships are established and proven to guarantee convergence and transferability, yet supplies no equations, proof sketches, or derivation steps, making it impossible to determine whether the results are independent of the SRHS heuristics or rest on unstated assumptions.
[SRHS description and bound] SRHS algorithm and generalization bound: the bound derivation likely relies on continuous or Lipschitz-bounded semantic neighborhood properties (standard for such proofs), but SRHS performs discrete token chunk expansions that can produce semantic jumps outside any coherence neighborhood without detection; this gap directly undermines the claimed guarantee that coherence maintenance ensures transferability.
[Evaluation] Experimental results: the 99.71% average ASR across 26 models is presented without baselines, error bars, statistical tests, or protocol details, so it cannot be assessed whether it validates the theoretical claims or merely reflects heuristic performance.

minor comments (1)

[Abstract] The abstract is dense with new terminology (SRA, SRHS, Coherence-Convergence Relationship) introduced without forward references to where they are formally defined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the presentation of our theoretical and empirical contributions without altering the core claims.

read point-by-point responses

Referee: Theoretical section on Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound: the manuscript asserts these relationships are established and proven to guarantee convergence and transferability, yet supplies no equations, proof sketches, or derivation steps, making it impossible to determine whether the results are independent of the SRHS heuristics or rest on unstated assumptions.

Authors: We acknowledge that the current manuscript presents the Coherence-Convergence Relationship and Cross-Model Semantic Generalization bound primarily at a conceptual level without the full set of equations or proof sketches. This omission was intended to maintain readability but has hindered verification. The relationships are derived from properties of semantic embedding spaces and coherence metrics, and are designed to be independent of the specific heuristics in SRHS. In the revised version, we will include the key equations defining semantic coherence, the statement of the bound, the main assumptions (e.g., bounded distances in embedding space), and an outline of the derivation steps to allow readers to assess independence from implementation details. revision: yes
Referee: SRHS algorithm and generalization bound: the bound derivation likely relies on continuous or Lipschitz-bounded semantic neighborhood properties (standard for such proofs), but SRHS performs discrete token chunk expansions that can produce semantic jumps outside any coherence neighborhood without detection; this gap directly undermines the claimed guarantee that coherence maintenance ensures transferability.

Authors: The referee correctly identifies a potential tension between the continuous assumptions typical of generalization bounds and the discrete token expansions performed by SRHS. We will revise the manuscript to clarify how the incremental chunk expansion is constrained to remain within coherence neighborhoods, for example by introducing an explicit coherence threshold and a mechanism to detect and revert jumps. If the bound requires Lipschitz continuity, we will either extend the theoretical analysis to discrete settings or provide supporting analysis showing that observed expansions satisfy the neighborhood condition in practice. This addresses the gap while preserving the claim that maintained coherence supports transferability. revision: partial
Referee: Experimental results: the 99.71% average ASR across 26 models is presented without baselines, error bars, statistical tests, or protocol details, so it cannot be assessed whether it validates the theoretical claims or merely reflects heuristic performance.

Authors: We agree that the experimental reporting requires additional rigor to substantiate the theoretical claims. The reported average ASR will be supplemented in the revision with comparisons against established baselines (including token-level methods such as GCG), error bars derived from multiple independent runs, appropriate statistical tests for significance, and a complete experimental protocol detailing model versions, evaluation criteria, and hyperparameter settings. These additions will enable clearer assessment of whether the results align with the predicted benefits of semantic coherence for convergence and transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation

full rationale

The paper first states the establishment of the semantic Coherence-Convergence Relationship and derivation of the Cross-Model Semantic Generalization bound as independent theoretical contributions that prove guarantees for convergence and transferability. These precede the description of the SRHS algorithm as an operationalization in discrete token space. No load-bearing step reduces the bound or relationship to a fitted parameter, self-citation chain, or definitional tautology; the reported ASR is presented as empirical outcome rather than a constructed prediction. The derivation chain is therefore self-contained against the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on newly introduced theoretical relationships and a search algorithm whose correctness is asserted rather than derived from prior external results.

axioms (1)

ad hoc to paper Semantic Coherence-Convergence Relationship
Paper establishes this to connect prompt semantic coherence to white-box convergence and black-box transferability.

invented entities (2)

Semantic Representation Attack (SRA) no independent evidence
purpose: Reconceptualize adversarial objectives from exact text to malicious semantic representations.
Core new framework proposed in the work.
Semantic Representation Heuristic Search (SRHS) no independent evidence
purpose: Preserve interpretability and structural coherence during incremental discrete token chunk expansion.
New algorithm introduced to operationalize SRA.

pith-pipeline@v0.9.0 · 5511 in / 1471 out tokens · 73807 ms · 2026-05-12T01:09:39.613498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 10 internal anchors

[1]

Language Models are Few-Shot Learners

T. B. Brown, “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

A survey on multimodal large language models for autonomous driving,

C. Cui, Y . Ma, X. Cao, W. Ye, Y . Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liaoet al., “A survey on multimodal large language models for autonomous driving,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 958– 979

work page 2024
[5]

Lmdrive: Closed-loop end-to-end driving with large language models,

H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 120–15 130

work page 2024
[6]

Language models meet world models: Embodied experiences enhance language models,

J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,”Advances in neural information processing systems, vol. 36, 2024

work page 2024
[7]

Large language models as generalizable policies for embodied tasks,

A. Szot, M. Schwarzer, H. Agrawal, B. Mazoure, R. Metcalf, W. Talbott, N. Mackraz, R. D. Hjelm, and A. T. Toshev, “Large language models as generalizable policies for embodied tasks,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

Large language models in medicine,

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,”Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023

work page 1930
[9]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohlet al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

work page 2023
[10]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scaling instruction-finetuned language models,”Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

work page 2024
[11]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023
[12]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[13]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in neural information processing systems, vol. 30, 2017. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

work page 2017
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Fast adversarial attacks on language models in one gpu minute,

V . S. Sadasivan, S. Saha, G. Sriramanan, P. Kattakinda, A. Chegini, and S. Feizi, “Fast adversarial attacks on language models in one gpu minute,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[16]

Does refusal training in llms generalize to the past tense?

M. Andriushchenko and N. Flammarion, “Does refusal training in llms generalize to the past tense?” inThe Thirteenth International Conference on Learning Representations, 2025, pp. 85 406–85 420

work page 2025
[17]

Obscure but effective: Classical chinese jailbreak prompt optimization via bio-inspired search,

X. Huang, S. Qin, X. Jia, R. Duan, H. Yan, Z. Zeng, F. Yang, Y . Liu, and X. Jia, “Obscure but effective: Classical chinese jailbreak prompt optimization via bio-inspired search,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[18]

Improved techniques for optimization-based jailbreaking on large language models,

X. Jia, T. Pang, C. Du, Y . Huang, J. Gu, Y . Liu, X. Cao, and M. Lin, “Improved techniques for optimization-based jailbreaking on large language models,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Intriguing properties of neural networks,

C. Szegedy, “Intriguing properties of neural networks,” inInternational Conference on Learning Representations, 2014

work page 2014
[20]

Explaining and harnessing adversarial examples,

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inInternational Conference on Learning Repre- sentations, 2015

work page 2015
[21]

Defense against adversarial attacks using high-level representation guided denoiser,

F. Liao, M. Liang, Y . Dong, T. Pang, X. Hu, and J. Zhu, “Defense against adversarial attacks using high-level representation guided denoiser,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1778–1787

work page 2018
[22]

Ensemble adversarial training: Attacks and defenses,

F. Tram `er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” in International Conference on Learning Representations, 2018

work page 2018
[23]

Adversarial attacks and defenses for large language models (llms): methods, frameworks & challenges,

P. Kumar, “Adversarial attacks and defenses for large language models (llms): methods, frameworks & challenges,”International Journal of Multimedia Information Retrieval, vol. 13, no. 3, p. 26, 2024

work page 2024
[24]

Adversarial attacks on large language models,

J. Zou, S. Zhang, and M. Qiu, “Adversarial attacks on large language models,” inInternational Conference on Knowledge Science, Engineer- ing and Management. Springer, 2024, pp. 85–96

work page 2024
[25]

Abu- Ghazaleh

E. Shayegani, M. A. A. Mamun, Y . Fu, P. Zaree, Y . Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,”arXiv preprint arXiv:2310.10844, 2023

work page arXiv 2023
[26]

Ignore previous prompt: Attack techniques for language models,

F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” inNeurIPS ML Safety Workshop, 2022

work page 2022
[27]

Does refusal training in llms generalize to the past tense?

M. Andriushchenko and N. Flammarion, “Does refusal training in llms generalize to the past tense?”arXiv preprint arXiv:2407.11969, 2024

work page arXiv 2024
[28]

Autodan: Generating stealthy jailbreak prompts on aligned large language models,

X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generating stealthy jailbreak prompts on aligned large language models,” inThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[29]

Autodan: Interpretable gradient-based ad- versarial attacks on large language models,

S. Zhu, R. Zhang, B. An, G. Wu, J. Barrow, Z. Wang, F. Huang, A. Nenkova, and T. Sun, “Autodan: Interpretable gradient-based ad- versarial attacks on large language models,” inFirst Conference on Language Modeling, 2024

work page 2024
[30]

Z. S. Harris and Z. S. Harris,Co-occurrence and transformation in linguistic structure. Springer, 1970

work page 1970
[31]

Paraphrasing with bilingual parallel corpora,

C. Bannard and C. Callison-Burch, “Paraphrasing with bilingual parallel corpora,” inProceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 597–604

work page 2005
[32]

Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[33]

Semantic representation attack against aligned large language models,

J. Lian, J. Pan, L. Wang, Y . Wang, S. Mei, and L.-P. Chau, “Semantic representation attack against aligned large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems, 2025

work page 2025
[34]

Diffusion models for imperceptible and transferable adversarial attack,

J. Chen, H. Chen, K. Chen, Y . Zhang, Z. Zou, and Z. Shi, “Diffusion models for imperceptible and transferable adversarial attack,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 961–977, 2024

work page 2024
[35]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018

work page 2018
[36]

Adversarial examples in the physical world,

A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” inArtificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112

work page 2018
[37]

Towards evaluating the robustness of neural networks,

N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in2017 ieee symposium on security and privacy (sp). Ieee, 2017, pp. 39–57

work page 2017
[38]

Physical adversarial attacks for surveillance: A survey,

K. Nguyen, T. Fernando, C. Fookes, and S. Sridharan, “Physical adversarial attacks for surveillance: A survey,”IEEE Transactions on Neural Networks and Learning Systems, 2023

work page 2023
[39]

Fooling automated surveil- lance cameras: adversarial patches to attack person detection,

S. Thys, W. Van Ranst, and T. Goedem ´e, “Fooling automated surveil- lance cameras: adversarial patches to attack person detection,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0

work page 2019
[40]

Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,

M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” inProceedings of the 2016 acm sigsac conference on computer and communications security, 2016, pp. 1528–1540

work page 2016
[41]

Cba: Contextual background attack against optical aerial detection in the physical world,

J. Lian, X. Wang, Y . Su, M. Ma, and S. Mei, “Cba: Contextual background attack against optical aerial detection in the physical world,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 16, 2023

work page 2023
[42]

Unified adversarial patch for visible-infrared cross-modal attacks in the physical world,

X. Wei, Y . Huang, Y . Sun, and J. Yu, “Unified adversarial patch for visible-infrared cross-modal attacks in the physical world,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2348–2363, 2023

work page 2023
[43]

Blackboxbench: A comprehensive benchmark of black-box adversarial attacks,

M. Zheng, X. Yan, Z. Zhu, H. Chen, and B. Wu, “Blackboxbench: A comprehensive benchmark of black-box adversarial attacks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 9, pp. 7867–7885, 2025

work page 2025
[44]

a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \

S. Geisler, T. Wollschl ¨ager, M. Abdalla, J. Gasteiger, and S. G¨unnemann, “Attacking large language models with projected gradient descent,” arXiv preprint arXiv:2402.09154, 2024

work page arXiv 2024
[45]

Gradient-based adversarial attacks against text transformers,

C. Guo, A. Sablayrolles, H. J ´egou, and D. Kiela, “Gradient-based adversarial attacks against text transformers,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 5747–5757

work page 2021
[46]

Semantic equivalent adversarial data augmentation for visual question answering,

R. Tang, C. Ma, W. E. Zhang, Q. Wu, and X. Yang, “Semantic equivalent adversarial data augmentation for visual question answering,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16. Springer, 2020, pp. 437–453

work page 2020
[47]

Is bert really robust? a strong baseline for natural language attack on text classification and entailment,

D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, 2020, pp. 8018–8025

work page 2020
[48]

Adversarial attack on sentiment classification,

Y .-T. Tsai, M.-C. Yang, and H.-Y . Chen, “Adversarial attack on sentiment classification,” inProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, 2019, pp. 233–240

work page 2019
[49]

Jailbroken: How does llm safety training fail?

A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm safety training fail?”Advances in Neural Information Processing Sys- tems, vol. 36, 2024

work page 2024
[50]

” do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671–1685

work page 2024
[51]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Liet al., “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,”Proceedings of Machine Learning Research, vol. 235, pp. 35 181–35 224, 2024

work page 2024
[52]

Jailbreaking Black Box Large Language Models in Twenty Queries

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,”arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review arXiv 2023
[53]

Tree of attacks: Jailbreaking black-box llms automatically,

A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y . Singer, and A. Karbasi, “Tree of attacks: Jailbreaking black-box llms automatically,”Advances in Neural Information Processing Systems, vol. 37, pp. 61 065–61 105, 2024

work page 2024
[54]

Does safety training of llms generalize to semantically related natural prompts?

S. Addepalli, Y . Varun, A. Suggala, K. Shanmugam, and P. Jain, “Does safety training of llms generalize to semantically related natural prompts?” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[55]

Jailbreaking leading safety-aligned llms with simple adaptive attacks,

M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking leading safety-aligned llms with simple adaptive attacks,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[56]

Are aligned neural networks adversarially aligned?

N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[57]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,

Y . Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geiping, and T. Gold- stein, “Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 008–51 025, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

work page 2023
[58]

Universal adversarial triggers for attacking and analyzing nlp,

E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

work page 2019
[59]

Logan IV, Eric Wallace, and Sameer Singh

T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Auto- prompt: Eliciting knowledge from language models with automatically generated prompts,”arXiv preprint arXiv:2010.15980, 2020

work page arXiv 2010
[60]

Red teaming language models with language models,

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448

work page 2022
[61]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,

Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 14 322–14 350

work page 2024
[62]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Baichuan 2: Open large-scale language models,

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang, D. Yanet al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023

work page arXiv 2023
[66]

Koala: A dialogue model for academic research,

X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research,” Blog post, April 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04/ 03/koala/

work page 2023
[67]

Orca 2: Teaching small language models how to reason.arXiv preprint arXiv:2311.11045, 2023

A. Mitra, L. Del Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwalet al., “Orca 2: Teaching small language models how to reason,”arXiv preprint arXiv:2311.11045, 2023

work page arXiv 2023
[68]

Zephyr: Direct distillation of lm alignment,

L. Tunstall, E. E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. V on Werra, C. Fourrier, N. Habibet al., “Zephyr: Direct distillation of lm alignment,” inFirst Conference on Language Modeling, 2024

work page 2024
[69]

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling,

S. Kim, D. Kim, C. Park, W. Lee, W. Song, Y . Kim, H. Kim, Y . Kim, H. Lee, J. Kimet al., “Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), 2024...

work page 2024
[70]

Openchat: Advancing open-source language models with mixed-quality data,

G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, and Y . Liu, “Openchat: Advancing open-source language models with mixed-quality data,” in The Twelfth International Conference on Learning Representations, 2024

work page 2024
[71]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[72]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b. arxiv 2023,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Starling-7b: Improving llm helpfulness & harmlessness with rlaif,

B. Zhu, E. Frick, T. Wu, H. Zhu, and J. Jiao, “Starling-7b: Improving llm helpfulness & harmlessness with rlaif,” November 2023

work page 2023
[74]

Smoothllm: Defend- ing large language models against jailbreaking attacks,

A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defend- ing large language models against jailbreaking attacks,”Transactions on Machine Learning Research, 2025

work page 2025
[75]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

J. Yu, X. Lin, Z. Yu, and X. Xing, “Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts,”arXiv preprint arXiv:2309.10253, 2023

work page internal anchor Pith review arXiv 2023
[76]

Purple llama CyberSecEval : A secure coding benchmark for language models

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontanaet al., “Purple llama cyberseceval: A secure coding benchmark for language models,”arXiv preprint arXiv:2312.04724, 2023

work page arXiv 2023
[77]

Jailbreaking black box large language models in twenty queries,

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42. Jiawei Lianis pursuing dual Ph.D. degrees through the joint Ph.D. program between Northwestern Polytechnical University (...

work page 2025
[78]

Her research interests include AI for science, AI safety, and few-shot learning

She is currently a postdoctoral researcher at the Hong Kong Institute of AI for Science, City University of Hong Kong, Hong Kong SAR. Her research interests include AI for science, AI safety, and few-shot learning. Yi Wang(Member, IEEE) received the B.Eng. degree in electronic informa- tion engineering and the M.Eng. degree in information and signal proce...

work page 2013
[79]

= exp  − 1 |c|+|y ∗ 1| |c|+|y∗ 1 |X i=1 logP(t i|t<i)   = exp  − 1 |c|+|y ∗ 1|   |c|X i=1 logP(c i|c<i) + |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j)     . (23) Given thatH PPL(c⊕y ∗ 1)< τ, we can derive: exp  − 1 |c|+|y ∗ 1|   |c|X i=1 logP(c i|c<i) + |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j)     < τ − |y∗ 1 |X j=1 logP(y ∗ 1,j|c, y∗ 1,<j) − |c...

work page
[80]

= mX i=1 log Pθ(y∗ 1,i|c, y∗ 1,<i) Pθ(y∗ 2,i|c, y∗ 2,<i) .(27) This sequence log-probability ratio quantifies token-level differences between two sequences in autoregressive con- texts. This formulation is particularly appropriate for language models because it captures the cumulative divergence in the model’s predictive behavior when generating semantica...

work page 2021

Showing first 80 references.