pith. machine review for the scientific record. sign in

arxiv: 2310.03684 · v4 · submitted 2023-10-05 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, George J. Pappas, Hamed Hassani

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords jailbreaking attackslarge language modelsadversarial defenseprompt perturbationrobustnessaggregation method
0
0 comments X

The pith

SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SmoothLLM as a defense against jailbreaking attacks on LLMs like GPT and Llama. It is based on the observation that adversarial prompts break easily when small character changes are made. The method creates several slightly altered versions of the user's prompt, runs the model on each, and combines the outputs to filter out attacks. This approach outperforms prior defenses on common jailbreak methods such as GCG and PAIR, while maintaining compatibility with any LLM and only a small cost to regular performance.

Core claim

SmoothLLM works by randomly perturbing multiple copies of an input prompt through character-level substitutions and then aggregating the LLM's predictions across those copies to detect and block adversarial inputs designed to produce objectionable content.

What carries the argument

Random character-level perturbations of the prompt combined with aggregation of model outputs across perturbed copies.

If this is right

  • Across popular LLMs, SmoothLLM achieves state-of-the-art robustness to GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks.
  • SmoothLLM remains effective even against adaptive versions of the GCG attack.
  • There is only a small trade-off between the defense's robustness and the model's performance on non-adversarial inputs.
  • The method works with any large language model without requiring changes to the model itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar perturbation and aggregation strategies might apply to other types of adversarial attacks on language models.
  • The brittleness of adversarial prompts suggests that defenses could be designed around input randomization rather than model retraining.
  • Testing on a wider range of attack types could reveal the limits of this approach.

Load-bearing premise

Adversarially generated prompts lose their effectiveness when subjected to random character-level modifications.

What would settle it

Finding or constructing a jailbreaking prompt that continues to elicit the target objectionable output even after multiple random character perturbations would falsify the core defense mechanism.

read the original abstract

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SmoothLLM, a defense against jailbreaking attacks on LLMs. It is based on the empirical observation that adversarially generated prompts are brittle to character-level perturbations. The algorithm creates multiple randomly perturbed copies of an input prompt, queries the target LLM on each copy, and aggregates the outputs to detect and mitigate adversarial inputs. The paper reports that SmoothLLM achieves state-of-the-art robustness against GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple popular LLMs, remains effective against adaptive GCG attacks, incurs only a modest trade-off with nominal performance, and is compatible with any LLM. Public code is released.

Significance. If the reported empirical results hold under the stated conditions, SmoothLLM offers a practical, model-agnostic defense that substantially improves robustness to several prominent jailbreaking methods without requiring model retraining or internal access. The broad evaluation across LLMs and attack types, combined with public code and resistance to adaptive attacks, constitutes a concrete contribution to LLM safety research.

major comments (2)
  1. [§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.
  2. [§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.
minor comments (2)
  1. [Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legend entries could be enlarged or clarified to improve readability of the robustness metrics across attack methods.
  2. [Abstract] Abstract: The phrase 'small, though non-negligible trade-off' should be accompanied by a quantitative bound (e.g., average drop in benign accuracy) to give readers an immediate sense of the cost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We appreciate the constructive feedback on improving the clarity of the method and experimental details, which we will address to strengthen reproducibility.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.

    Authors: We thank the referee for highlighting this point. In SmoothLLM, the aggregation rule is majority voting across the LLM outputs on the perturbed prompts: an input is classified as adversarial if a strict majority of the perturbed copies produce a refusal (safe) response. In the event of a tie, the input is classified as non-adversarial to avoid over-flagging clean prompts. We will revise §3.2 to state this rule explicitly, include pseudocode for the full procedure, and clarify the tie-breaking logic. revision: yes

  2. Referee: [§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.

    Authors: We agree that more detail is warranted. The perturbation rate (10%) and number of copies (10) were selected via a small grid search performed on a held-out set of 50 prompts drawn from AdvBench, using only the base Llama-2-7B model and without reference to any particular attack. The search prioritized a balance between robustness and nominal performance. We will expand the description in §4.1 to document this process and add a parameter-sensitivity ablation to the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents SmoothLLM as an empirical algorithm motivated by the external observation that adversarially generated prompts are brittle to character-level perturbations. This brittleness is demonstrated through direct experiments on GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple LLMs, with no derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps that reduce the central claim to its own inputs by construction. The defense (random perturbation + aggregation) follows straightforwardly from the empirical finding without circular redefinition or uniqueness theorems imported from prior self-work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method depends on one key domain assumption about adversarial prompt brittleness and a small number of implementation choices for perturbation rate and copy count that are not derived from first principles.

free parameters (2)
  • perturbation rate
    Probability or number of characters altered per copy, chosen to balance detection and nominal performance
  • number of perturbed copies
    Count of randomized prompt variants fed to the model, selected for robustness versus compute cost
axioms (1)
  • domain assumption Adversarially-generated prompts are brittle to small character-level perturbations while benign prompts remain stable
    This empirical observation is invoked as the foundation for the random perturbation defense

pith-pipeline@v0.9.0 · 5488 in / 1416 out tokens · 70401 ms · 2026-05-14T17:05:31.423908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

    cs.MA 2026-05 unverdicted novelty 7.0

    OrchJail uses orchestration-guided fuzzing to jailbreak tool-calling T2I agents by targeting high-risk tool patterns, yielding higher attack success rates, better image quality, and lower costs than prior prompt-only methods.

  2. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  3. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...

  4. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  5. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  6. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

    cs.AI 2026-04 unverdicted novelty 7.0

    PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...

  7. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  8. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  9. Behavioral Integrity Verification for AI Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.

  10. Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

    cs.CR 2026-05 accept novelty 6.0

    JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...

  11. Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

    cs.AI 2026-05 unverdicted novelty 6.0

    PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

  12. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...

  13. SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.

  14. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  15. The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

  16. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    cs.CL 2024-03 conditional novelty 6.0

    InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...

  17. Jailbreaking Black Box Large Language Models in Twenty Queries

    cs.LG 2023-10 conditional novelty 6.0

    PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

  18. Re-Triggering Safeguards within LLMs for Jailbreak Detection

    cs.CR 2026-05 unverdicted novelty 5.0

    Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

  19. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

  20. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  21. MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 20 Pith papers · 18 internal anchors

  1. [1]

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020. 1

  2. [2]

    The ai alignment problem: why it is hard, and where to start

    Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4, 2016. 1

  3. [3]

    Artificial intelligence, values, and alignment

    Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020

  4. [4]

    The alignment problem: Machine learning and human values

    Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company,

  5. [5]

    Regulating chatgpt and other large generative ai models

    Philipp Hacker, Andreas Engel, and Marco Mauer. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1112–1123, 2023. 1

  6. [6]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022

  7. [7]

    Improving alignment of dialogue agents via targeted human judgements

    Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022. 1

  8. [8]

    Toxicity in chatgpt: Analyzing persona-assigned language models

    Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023. 1

  9. [9]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023. 1, 3

  10. [10]

    Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

  11. [11]

    A safe harbor for ai evaluation and red teaming

    Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, et al. A safe harbor for ai evaluation and red teaming. arXiv preprint arXiv:2403.04893, 2024. 1

  12. [12]

    Adversarial demonstra- tion attacks on large language models

    Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. Adversarial demonstra- tion attacks on large language models. arXiv preprint arXiv:2305.14950, 2023. 1

  13. [13]

    Risks of ai foundation models in education

    Su Lin Blodgett and Michael Madaio. Risks of ai foundation models in education. arXiv preprint arXiv:2110.10024, 2021. 1

  14. [14]

    Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns

    Malik Sallam. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI, 2023. 1

  15. [15]

    BloombergGPT: A Large Language Model for Finance

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023. 1

  16. [16]

    Adversarial prompting for black box foundation models

    Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023. 1 14

  17. [17]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020

  18. [18]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023. 2, 3, 9, 12, 26

  19. [19]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 1

  20. [20]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 1, 2, 3, 5, 6, 9, 10, 12, 25, 27, 28, 29, 30, 31, 33, 34, 37, 40

  21. [21]

    Jailbreaking leading safety- aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety- aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024. 2, 3, 9

  22. [22]

    Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

    Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921, 2024. 2, 3, 9

  23. [23]

    Attack- ing large language models with projected gradient descent

    Simon Geisler, Tom Wollschläger, MHI Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attack- ing large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024. 1, 3

  24. [24]

    Certified robustness to adversarial examples with differential privacy

    Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019. 2, 5, 38

  25. [25]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019. 11, 38, 39

  26. [26]

    Provably robust deep learning via adversarially trained smoothed classifiers

    Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019. 2, 5, 11, 38

  27. [27]

    Jail- breakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jail- breakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. 3, 32

  28. [28]

    Low-resource languages jailbreak gpt-4

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023. 3

  29. [29]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. 3, 26

  30. [30]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023. 3

  31. [31]

    A survey of adversarial defenses and robustness in nlp

    Shreya Goyal, Sumanth Doddapaneni, Mitesh M Khapra, and Balaraman Ravindran. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023. 4

  32. [32]

    Adversarial training for large neural language models

    Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020. 4 15

  33. [33]

    Adversarial training methods for semi-supervised text classification

    Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016. 4

  34. [34]

    TextBugger: Generating Adversarial Text Against Real-world Applications

    Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018. 4, 39

  35. [35]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023. 4, 10, 32

  36. [36]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023. 4

  37. [37]

    On adaptive attacks to adversarial example defenses

    Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645, 2020. 9

  38. [38]

    Detecting language model attacks with perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023. 10, 32

  39. [39]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911,

  40. [40]

    Piqa: Reasoning about physical common- sense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical common- sense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 11, 29

  41. [41]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. 11, 29

  42. [42]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022. 11, 29

  43. [43]

    Robustbench: a standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020. 11

  44. [44]

    Robustness and accuracy tradeoffs for recommender systems under attack

    Carlos E Seminario and David C Wilson. Robustness and accuracy tradeoffs for recommender systems under attack. In Twenty-Fifth International FLAIRS Conference, 2012. 11

  45. [45]

    Fast is better than free: Revisiting adversarial training

    Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020. 11

  46. [46]

    Query complexity of adversarial attacks

    Grzegorz Gluch and Rüdiger Urbanke. Query complexity of adversarial attacks. In International Conference on Machine Learning, pages 3723–3733. PMLR, 2021

  47. [47]

    Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019

    Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019. 11

  48. [48]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 12, 38 16

  49. [49]

    Generalizing to unseen domains via adversarial data augmentation

    Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018. 12

  50. [50]

    Denoised smoothing: A provable defense for pretrained classifiers

    Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945– 21957, 2020. 13

  51. [51]

    (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022

    Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, and J Zico Kolter. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022. 13

  52. [52]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 25

  53. [53]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 25

  54. [54]

    Robustness may be at odds with accuracy

    Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018. 29, 40

  55. [55]

    Provable tradeoffs in adversari- ally robust classification

    Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adversari- ally robust classification. IEEE Transactions on Information Theory, 2023. 40

  56. [56]

    Precise tradeoffs in adversarial training for linear regression

    Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020. 29

  57. [57]

    Perceptual

    Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020. 38

  58. [58]

    Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data

    Alexander Robey, Hamed Hassani, and George J Pappas. Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020

  59. [59]

    Learning perturbation sets for robust machine learning

    Eric Wong and J Zico Kolter. Learning perturbation sets for robust machine learning. arXiv preprint arXiv:2007.08450, 2020. 38

  60. [60]

    Breeds: Benchmarks for subpopulation shift

    Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020. 38

  61. [61]

    Wilds: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR,

  62. [62]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019. 38

  63. [63]

    Probable domain generalization via quantile risk minimization

    Cian Eastwood, Alexander Robey, Shashank Singh, Julius Von Kügelgen, Hamed Hassani, George J Pappas, and Bernhard Schölkopf. Probable domain generalization via quantile risk minimization. Advances in Neural Information Processing Systems, 35:17340–17358, 2022

  64. [64]

    Model-based domain generalization.Advances in Neural Information Processing Systems, 34:20210–20229, 2021

    Alexander Robey, George J Pappas, and Hamed Hassani. Model-based domain generalization.Advances in Neural Information Processing Systems, 34:20210–20229, 2021. 38 17

  65. [65]

    Evasion attacks against machine learning at test time

    Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´ c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 38...

  66. [66]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 38

  67. [67]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. 38

  68. [68]

    Theo- retically principled trade-off between robustness and accuracy

    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theo- retically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019. 38

  69. [69]

    Randomized smoothing of all shapes and sizes

    Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020. 38, 39

  70. [70]

    (de) randomized smoothing for certifiable defense against patch attacks

    Alexander Levine and Soheil Feizi. (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465–6475, 2020. 38, 39

  71. [71]

    Certified defences against adversarial patch attacks on semantic segmentation

    Maksym Yatsura, Kaspar Sakmann, N Grace Hua, Matthias Hein, and Jan Hendrik Metzen. Certified defences against adversarial patch attacks on semantic segmentation. arXiv preprint arXiv:2209.05980, 2022

  72. [72]

    Stability guarantees for feature attributions with multiplicative smoothing

    Anton Xue, Rajeev Alur, and Eric Wong. Stability guarantees for feature attributions with multiplicative smoothing. arXiv preprint arXiv:2307.05902, 2023. 38, 39

  73. [73]

    ℓ1 adversarial robustness certificates: a randomized smoothing approach

    Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certificates: a randomized smoothing approach. 2019. 39

  74. [74]

    Certified defense to image transformations via randomized smoothing

    Marc Fischer, Maximilian Baader, and Martin Vechev. Certified defense to image transformations via randomized smoothing. Advances in Neural information processing systems, 33:8404–8417, 2020. 39

  75. [75]

    Certified robustness to label- flipping attacks via randomized smoothing

    Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and Zico Kolter. Certified robustness to label- flipping attacks via randomized smoothing. In International Conference on Machine Learning , pages 8230–8241. PMLR, 2020. 39

  76. [76]

    Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

    John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909,

  77. [77]

    Adversarial attacks on deep- learning models in natural language processing: A survey

    Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial attacks on deep- learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020. 39

  78. [78]

    Generating natural language adversarial examples through probability weighted word saliency

    Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019. 39

  79. [79]

    Natural language adversarial attack and defense in word level

    Xiaosen Wang, Hao Jin, and Kun He. Natural language adversarial attack and defense in word level. arXiv preprint arXiv:1909.06723, 2019

  80. [80]

    Generating Natural Language Adversarial Examples

    Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018. 39 18

Showing first 80 references.