arxiv: 2309.00614 · v2 · submitted 2023-09-01 · 💻 cs.LG · cs.CL· cs.CR

Recognition: no theorem link

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain , Avi Schwarzschild , Yuxin Wen , Gowthami Somepalli , John Kirchenbauer , Ping-yeh Chiang , Micah Goldblum , Aniruddha Saha

show 2 more authors

Jonas Geiping Tom Goldstein

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CR

keywords adversarial attacksLLM securityjailbreakingdefense strategiesperplexity detectionadversarial trainingtext optimization

0 comments

The pith

Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines practical threat models for attacking aligned LLMs and tests baseline defenses drawn from adversarial machine learning. It evaluates perplexity-based detection, input preprocessing through paraphrasing and retokenization, and adversarial training across white-box and gray-box scenarios. These defenses prove useful because generating adversarial text prompts is harder than in other domains due to the limitations of current discrete optimization techniques. A sympathetic reader would care because LLMs are becoming central to many applications, and simple defenses could reduce risks without major performance losses. The work highlights differences from computer vision, where adaptive attacks are more straightforward.

Core claim

The central discovery is that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Baseline defenses including perplexity detection, paraphrase and retokenization preprocessing, and adversarial training offer varying degrees of protection depending on the access level and trade-offs considered.

What carries the argument

Baseline defenses consisting of perplexity-based detection, input preprocessing via paraphrasing and retokenization, and adversarial training, evaluated in white-box and gray-box settings.

If this is right

Perplexity-based detection can flag many jailbreaking attempts effectively.
Preprocessing steps like paraphrasing reduce the success rate of attacks.
Adversarial training enhances model robustness at some cost to normal performance.
Gray-box and white-box adaptive attacks are limited by optimization difficulties in text.
Stronger defenses may be more viable in LLMs than in computer vision due to domain differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simple input filtering could become a standard first line of defense for deployed LLMs.
Research into better discrete optimizers might close the current gap and require stronger defenses.
These findings suggest prioritizing efficiency in attack methods for future security evaluations.
Testing on a wider range of models could reveal if the defense effectiveness generalizes.

Load-bearing premise

The specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.

What would settle it

Demonstrating a low-cost, high-success discrete optimizer that consistently bypasses the tested defenses on current aligned models would falsify the central claim about attack difficulty.

read the original abstract

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Baseline defenses like perplexity filtering and paraphrasing hold up better than expected against current LLM jailbreaks because discrete text optimizers remain weak and costly.

read the letter

This paper's main takeaway is that simple defenses drawn from computer vision—perplexity-based detection, paraphrase and retokenization preprocessing, and adversarial training—give measurable robustness to aligned LLMs against the jailbreak attacks tested so far. The authors argue that the inefficiency and high cost of existing discrete optimizers for text make it difficult to run strong adaptive attacks, which explains why these baselines perform better here than they did in vision domains. They map out white-box and gray-box settings and discuss the practical feasibility and performance trade-offs for each defense type. That systematic framing and the direct comparison across attack types is the clearest new element; it gives a concrete starting point for thinking about LLM security that was missing from earlier work focused only on attack generation. The paper does a clean job laying out the threat models and why each defense is or is not deployable at scale. The central claim about optimizer weakness only lands if the attacks were actually adapted to the defended models, for example by folding the perplexity filter or paraphrase step into the discrete search loop. The abstract does not spell this out, so the full experiments need to show the exact attack procedure for each defense variant. Without that, the observed robustness could simply reflect non-adaptive transfer rather than any fundamental limit on optimization. No other major gaps stand out from the description. This is useful reading for anyone working on practical LLM safety who needs a set of baseline numbers and a clear discussion of trade-offs. It deserves peer review because the questions are timely, the approach is straightforward, and the results can be checked once the adaptation details are in front of a referee.

Referee Report

2 major / 2 minor

Summary. The paper evaluates baseline defenses (perplexity-based detection, paraphrase and retokenization preprocessing, and adversarial training) against adversarial jailbreaking attacks on aligned LLMs. It considers white-box and gray-box threat models, discusses robustness-performance trade-offs, and concludes that the weakness of existing discrete text optimizers combined with high optimization costs renders standard adaptive attacks more challenging for LLMs than in computer vision.

Significance. If the empirical results hold under properly adaptive attacks, the work provides a useful initial map of feasible defenses and highlights domain-specific differences from vision-based adversarial ML. The focus on practical threat models and the call for stronger optimizers or more robust filtering could usefully guide follow-up research.

major comments (2)

[§4] §4 (Experimental Evaluation): The central claim that 'the weakness of existing discrete optimizers for text... makes standard adaptive attacks more challenging' requires that the reported attacks were adapted to each defense (e.g., by placing the perplexity filter, paraphrase step, or retokenization inside the discrete search loop or via a surrogate). The methodology description does not specify this; if attacks were optimized only against the undefended model and then transferred, the observed robustness is consistent with non-adaptation rather than inherent optimizer limits.
[§4.1] §4.1 and associated tables: No error bars, number of random seeds, or statistical significance tests are reported for attack success rates. Given the stochastic nature of both the optimizer and the LLM sampling, single-run numbers are insufficient to support claims about relative defense strength or the robustness-performance trade-off.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., attack success rates under the strongest defense) rather than remaining purely qualitative.
[§3] Notation for threat models (white-box vs. gray-box) is introduced but not consistently used when presenting results; a table mapping each defense to its feasible threat model would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on threat models and statistical reporting.

read point-by-point responses

Referee: [§4] §4 (Experimental Evaluation): The central claim that 'the weakness of existing discrete optimizers for text... makes standard adaptive attacks more challenging' requires that the reported attacks were adapted to each defense (e.g., by placing the perplexity filter, paraphrase step, or retokenization inside the discrete search loop or via a surrogate). The methodology description does not specify this; if attacks were optimized only against the undefended model and then transferred, the observed robustness is consistent with non-adaptation rather than inherent optimizer limits.

Authors: We acknowledge that our attack optimizations were performed against the undefended base model, with defenses applied only at evaluation time for detection and preprocessing methods (adversarial training was incorporated during optimization). This setup demonstrates transferability of attacks rather than full adaptivity. We agree this means the results do not conclusively prove inherent optimizer limits against adaptive attacks. We will revise §4 to explicitly describe the methodology and threat models, adjust the central claim to focus on the observed challenges with transferred attacks and high optimization costs, and add discussion of why full adaptive attacks (with defenses in the loop) remain computationally difficult. revision: partial
Referee: [§4.1] §4.1 and associated tables: No error bars, number of random seeds, or statistical significance tests are reported for attack success rates. Given the stochastic nature of both the optimizer and the LLM sampling, single-run numbers are insufficient to support claims about relative defense strength or the robustness-performance trade-off.

Authors: We agree that variability reporting is needed given the stochastic elements. Experiments used a single fixed seed per configuration for reproducibility, without multiple independent runs due to high computational costs of discrete optimization. In the revision we will explicitly state the number of seeds (one), add a limitations discussion, include error bars or variability notes where preliminary multi-run data exists, and perform basic statistical comparisons on the reported success rates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with independent experimental comparisons

full rationale

The paper presents an empirical study of baseline defenses (perplexity detection, paraphrase/retokenization preprocessing, adversarial training) against text-based adversarial attacks on LLMs. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Central claims rest on reported attack success rates and robustness-performance trade-offs from experiments in white-box/gray-box settings. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes are present. The finding on discrete optimizer weakness follows directly from the observed experimental outcomes rather than reducing to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that relies on standard adversarial ML evaluation practices and existing attack methods without introducing new fitted parameters or postulated entities.

axioms (1)

domain assumption Threat models and attack methods from computer vision transfer meaningfully to discrete text optimization in LLMs
Invoked when discussing white-box/gray-box settings and adaptive attacks.

pith-pipeline@v0.9.0 · 5548 in / 1021 out tokens · 52391 ms · 2026-05-13T23:20:42.902387+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
cs.AI 2026-05 conditional novelty 7.0

BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
cs.AI 2026-05 unverdicted novelty 7.0

Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
cs.CR 2026-05 conditional novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
cs.CR 2026-05 unverdicted novelty 6.0

SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
cs.CR 2026-05 unverdicted novelty 6.0

ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
cs.CR 2026-04 unverdicted novelty 6.0

BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models...
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
cs.CR 2026-04 unverdicted novelty 6.0

SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
How Adversarial Environments Mislead Agentic AI?
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
cs.CR 2026-04 unverdicted novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
cs.CL 2026-04 unverdicted novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
SALLIE: Safeguarding Against Latent Language & Image Exploits
cs.CR 2026-04 unverdicted novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 24 Pith papers · 14 internal anchors

[1]

Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 274--283. PMLR , July 2018. URL https://proceedings.mlr.press/v80/athalye18a.html

work page 2018
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Enhancing robustness of machine learning systems via data transformations

Arjun Nitin Bhagoji, Daniel Cullina, Chawin Sitawarin, and Prateek Mittal. Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pp.\ 1--5. IEEE, 2018

work page 2018
[5]

Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods

Nicholas Carlini and David Wagner. Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods . In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , AISec '17, pp.\ 3--14, New York, NY, USA , November 2017. Association for Computing Machinery . ISBN 978-1-4503-5202-4. doi:10.1145/3128572.3140444. URL https:...

work page doi:10.1145/3128572.3140444 2017
[6]

On Evaluating Adversarial Robustness

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On Evaluating Adversarial Robustness . arxiv:1902.06705[cs, stat], February 2019. doi:10.48550/arXiv.1902.06705. URL http://arxiv.org/abs/1902.06705

work page doi:10.48550/arxiv.1902.06705 1902
[7]

Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo , Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? arxiv:2306.15447[cs], June 2023. doi:10.48550/arXiv.2306.15447. URL http://arxiv.org/abs/2306.15447

work page doi:10.48550/arxiv.2306.15447 2023
[8]

Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch

Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell . Explore, Establish , Exploit : Red Teaming Language Models from Scratch . arxiv:2306.09442[cs], June 2023. doi:10.48550/arXiv.2306.09442. URL http://arxiv.org/abs/2306.09442

work page doi:10.48550/arxiv.2306.09442 2023
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[10]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Alpacafarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023

work page arXiv 2023
[12]

H ot F lip: White-box adversarial examples for text classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. H ot F lip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 31--36, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-2006. UR...

work page doi:10.18653/v1/p18-2006 2018
[13]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Black-box generation of adversarial text sequences to evade deep learning classifiers

Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp.\ 50--56. IEEE, 2018

work page 2018
[15]

Shumailov, Kassem Fawaz, and Nicolas Papernot

Yue Gao, I. Shumailov, Kassem Fawaz, and Nicolas Papernot. On the Limitations of Stochastic Pre-processing Defenses . In Advances in Neural Information Processing Systems , volume 35, pp.\ 24280--24294, December 2022. URL https://proceedings.neurips.cc/paper\_files/paper/2022/hash/997089469acbeb410405e43f0011be1f-Abstract-Conference.html

work page arXiv 2022
[16]

Breaking certified defenses: Semantic adversarial examples with spoofed robustness certificates

Amin Ghiasi, Ali Shafahi, and Tom Goldstein. Breaking certified defenses: Semantic adversarial examples with spoofed robustness certificates. In International Conference on Learning Representations, 2019

work page 2019
[17]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022

work page internal anchor Pith review arXiv 2022
[18]

Adversarial attacks on machine learning systems for high-frequency trading

Micah Goldblum, Avi Schwarzschild, Ankit Patel, and Tom Goldstein. Adversarial attacks on machine learning systems for high-frequency trading. In Proceedings of the Second ACM International Conference on AI in Finance, pp.\ 1--9, 2021

work page 2021
[19]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection . arxiv:2302.12173[cs], May 2023. doi:10.48550/arXiv.2302.12173. URL http://arxiv.org/abs/2302.12173

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.12173 2023
[21]

On the ( Statistical ) Detection of Adversarial Examples

Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the ( Statistical ) Detection of Adversarial Examples . arxiv:1702.06280[cs, stat], October 2017. doi:10.48550/arXiv.1702.06280. URL http://arxiv.org/abs/1702.06280

work page doi:10.48550/arxiv.1702.06280 2017
[22]

Towards deep neural network architectures robust to adversarial examples

Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014

work page arXiv 2014
[23]

Gradient-based Adversarial Attacks against Text Transformers

Chuan Guo, Alexandre Sablayrolles, Herv \'e J \'e gou, and Douwe Kiela. Gradient-based Adversarial Attacks against Text Transformers . arxiv:2104.13733[cs], April 2021. doi:10.48550/arXiv.2104.13733. URL http://arxiv.org/abs/2104.13733

work page doi:10.48550/arxiv.2104.13733 2021
[24]

2022 , month = jun, journal =

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved Problems in ML Safety . arxiv:2109.13916[cs], June 2022. doi:10.48550/arXiv.2109.13916. URL http://arxiv.org/abs/2109.13916

work page doi:10.48550/arxiv.2109.13916 2022
[25]

Semantic adversarial examples

Hossein Hosseini and Radha Poovendran. Semantic adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.\ 1614--1619, 2018

work page 2018
[26]

Bring your own data! self-supervised evaluation for large language models

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Bring your own data! self-supervised evaluation for large language models. arXiv preprint arXiv:2306.13651, 2023

work page arXiv 2023
[27]

Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs : Dual-Use Through Standard Security Attacks . arxiv:2302.05733[cs], February 2023. doi:10.48550/arXiv.2302.05733. URL http://arxiv.org/abs/2302.05733

work page doi:10.48550/arxiv.2302.05733 2023
[28]

Kirchenbauer, J

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023

work page arXiv 2023
[29]

Pretraining language models with human preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, pp.\ 17506--17533. PMLR, 2023

work page 2023
[30]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007. URL https://a...

work page doi:10.18653/v1/p18-1007 2018
[31]

Multi-step jailbreaking privacy attacks on chatgpt

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT . arxiv:2304.05197[cs], May 2023. doi:10.48550/arXiv.2304.05197. URL http://arxiv.org/abs/2304.05197

work page doi:10.48550/arxiv.2304.05197 2023
[32]

TextBugger: Generating Adversarial Text Against Real-world Applications

Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018

work page Pith review arXiv 2018
[33]

BERT - ATTACK : Adversarial attack against BERT using BERT

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT - ATTACK : Adversarial attack against BERT using BERT . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6193--6202, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.500. URL https:...

work page doi:10.18653/v1/2020.emnlp-main.500 2020
[34]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

FLIRT : Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, and Rahul Gupta. FLIRT : Feedback Loop In-context Red Teaming . arxiv:2308.04265[cs], August 2023. doi:10.48550/arXiv.2308.04265. URL http://arxiv.org/abs/2308.04265

work page doi:10.48550/arxiv.2308.04265 2023
[36]

Magnet: a two-pronged defense against adversarial examples

Dongyu Meng and Hao Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp.\ 135--147, 2017 a

work page 2017
[37]

MagNet : A Two-Pronged Defense against Adversarial Examples

Dongyu Meng and Hao Chen. MagNet : A Two-Pronged Defense against Adversarial Examples . In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS '17, pp.\ 135--147, New York, NY, USA , October 2017 b . Association for Computing Machinery . ISBN 978-1-4503-4946-8. doi:10.1145/3133956.3134057. URL https://dl.acm.org/doi...

work page doi:10.1145/3133956.3134057 2017
[38]

On detecting adversarial perturbations

Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017

work page arXiv 2017
[39]

Randomized smoothing with masked inference for adversarially robust text classifications

Han Cheol Moon, Shafiq Joty, Ruochen Zhao, Megh Thakkar, and Xu Chi. Randomized smoothing with masked inference for adversarially robust text classifications. arXiv preprint arXiv:2305.06522, 2023

work page arXiv 2023
[40]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. TextAttack : A Framework for Adversarial Attacks , Data Augmentation , and Adversarial Training in NLP . arXiv:2005.05909 [cs], October 2020. URL http://arxiv.org/abs/2005.05909. arXiv: 2005.05909

work page arXiv 2005
[41]

Diffusion Models for Adversarial Purification

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Animashree Anandkumar. Diffusion Models for Adversarial Purification . In Proceedings of the 39th International Conference on Machine Learning , pp.\ 16805--16827. PMLR , June 2022. URL https://proceedings.mlr.press/v162/nie22a.html

work page 2022
[42]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[43]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Ignore Previous Prompt: Attack Techniques For Language Models

F \'a bio Perez and Ian Ribeiro. Ignore Previous Prompt : Attack Techniques For Language Models . arxiv:2211.09527[cs], November 2022. doi:10.48550/arXiv.2211.09527. URL http://arxiv.org/abs/2211.09527

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09527 2022
[45]

Bpe-dropout: Simple and effective subword regularization

Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267, 2019

work page arXiv 1910
[46]

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual Adversarial Examples Jailbreak Aligned Large Language Models . In The Second Workshop on New Frontiers in Adversarial Machine Learning , August 2023. URL https://openreview.net/forum?id=cZ4j7L6oui

work page 2023
[47]

Data Augmentation Can Improve Robustness

Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann. Data Augmentation Can Improve Robustness . In Advances in Neural Information Processing Systems , volume 34, pp.\ 29935--29948. Curran Associates, Inc. , 2021. URL https://proceedings.neurips.cc/paper/2021/hash/fb4c48608ce8825b558ccf07169a3421-Abst...

work page 2021
[48]

Defense-gan: Protecting classifiers against adversarial attacks using generative models

Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018

work page arXiv 2018
[49]

Adversarial training for free! In Advances in Neural Information Processing Systems , volume 32

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc. , 2019. URL https://proceedings.neurips.cc/paper\_files/paper/2019/hash/7503cfacd12053d309b6be...

work page 2019
[50]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023

work page arXiv 2023
[51]

Autoprompt: Eliciting knowledge from language models wit h automatically generated prompts,

Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020

work page arXiv 2010
[52]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[53]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[54]

Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b

work page 2023
[55]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Detecting Adversarial Examples Is ( Nearly ) As Hard As Classifying Them

Florian Tramer. Detecting Adversarial Examples Is ( Nearly ) As Hard As Classifying Them . In Proceedings of the 39th International Conference on Machine Learning , pp.\ 21692--21702. PMLR , June 2022. URL https://proceedings.mlr.press/v162/tramer22a.html

work page 2022
[57]

On Adaptive Attacks to Adversarial Example Defenses

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On Adaptive Attacks to Adversarial Example Defenses . In Advances in Neural Information Processing Systems , volume 33, pp.\ 1633--1645. Curran Associates, Inc. , 2020. URL https://proceedings.neurips.cc/paper/2020/hash/11f38f8ecd71867b42433548d1078e38-Abstract.html

work page 2020
[58]

Universal adversarial triggers for attacking and analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2153--2162, Hong Kong, China, Novem...

work page doi:10.18653/v1/d19-1221 2019
[59]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023

Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard Prompts Made Easy : Gradient-Based Discrete Optimization for Prompt Tuning and Discovery . arXiv preprint arXiv:2302.03668, February 2023. URL https://arxiv.org/abs/2302.03668v1

work page arXiv 2023
[61]

Making an invisibility cloak: Real world adversarial attacks on object detectors

Zuxuan Wu, Ser-Nam Lim, Larry S Davis, and Tom Goldstein. Making an invisibility cloak: Real world adversarial attacks on object detectors. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16, pp.\ 1--17. Springer, 2020

work page 2020
[62]

Adversarial examples: Attacks and defenses for deep learning

Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30 0 (9): 0 2805--2824, 2019

work page 2019
[63]

Certified robustness for large language models with self-denoising

Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu, Yang Zhang, and Shiyu Chang. Certified robustness for large language models with self-denoising. arXiv preprint arXiv:2307.07171, 2023

work page arXiv 2023
[64]

Freelb: Enhanced adversarial training for natural language understanding

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019

work page arXiv 1909
[65]

PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts . arxiv:2306.04528[cs], June 2023. doi:10.48550/arXiv.2306.04528. URL http://arxiv.org/abs/2306.04528

work page doi:10.48550/arxiv.2306.04528 2023
[66]

Zimmermann, Wieland Brendel, Florian Tramer, and Nicholas Carlini

Roland S. Zimmermann, Wieland Brendel, Florian Tramer, and Nicholas Carlini. Increasing Confidence in Adversarial Robustness Evaluations . In Advances in Neural Information Processing Systems , volume 35, pp.\ 13174--13189, December 2022. URL https://proceedings.neurips.cc/paper\_files/paper/2022/hash/5545d9bcefb7d03d5ad39a905d14fbe3-Abstract-Conference.html

work page 2022
[67]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models . arxiv:2307.15043[cs], July 2023. doi:10.48550/arXiv.2307.15043. URL http://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15043 2023