Recognition: no theorem link
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Pith reviewed 2026-05-13 23:20 UTC · model grok-4.3
The pith
Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Baseline defenses including perplexity detection, paraphrase and retokenization preprocessing, and adversarial training offer varying degrees of protection depending on the access level and trade-offs considered.
What carries the argument
Baseline defenses consisting of perplexity-based detection, input preprocessing via paraphrasing and retokenization, and adversarial training, evaluated in white-box and gray-box settings.
If this is right
- Perplexity-based detection can flag many jailbreaking attempts effectively.
- Preprocessing steps like paraphrasing reduce the success rate of attacks.
- Adversarial training enhances model robustness at some cost to normal performance.
- Gray-box and white-box adaptive attacks are limited by optimization difficulties in text.
- Stronger defenses may be more viable in LLMs than in computer vision due to domain differences.
Where Pith is reading between the lines
- Simple input filtering could become a standard first line of defense for deployed LLMs.
- Research into better discrete optimizers might close the current gap and require stronger defenses.
- These findings suggest prioritizing efficiency in attack methods for future security evaluations.
- Testing on a wider range of models could reveal if the defense effectiveness generalizes.
Load-bearing premise
The specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.
What would settle it
Demonstrating a low-cost, high-success discrete optimizer that consistently bypasses the tested defenses on current aligned models would falsify the central claim about attack difficulty.
read the original abstract
As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates baseline defenses (perplexity-based detection, paraphrase and retokenization preprocessing, and adversarial training) against adversarial jailbreaking attacks on aligned LLMs. It considers white-box and gray-box threat models, discusses robustness-performance trade-offs, and concludes that the weakness of existing discrete text optimizers combined with high optimization costs renders standard adaptive attacks more challenging for LLMs than in computer vision.
Significance. If the empirical results hold under properly adaptive attacks, the work provides a useful initial map of feasible defenses and highlights domain-specific differences from vision-based adversarial ML. The focus on practical threat models and the call for stronger optimizers or more robust filtering could usefully guide follow-up research.
major comments (2)
- [§4] §4 (Experimental Evaluation): The central claim that 'the weakness of existing discrete optimizers for text... makes standard adaptive attacks more challenging' requires that the reported attacks were adapted to each defense (e.g., by placing the perplexity filter, paraphrase step, or retokenization inside the discrete search loop or via a surrogate). The methodology description does not specify this; if attacks were optimized only against the undefended model and then transferred, the observed robustness is consistent with non-adaptation rather than inherent optimizer limits.
- [§4.1] §4.1 and associated tables: No error bars, number of random seeds, or statistical significance tests are reported for attack success rates. Given the stochastic nature of both the optimizer and the LLM sampling, single-run numbers are insufficient to support claims about relative defense strength or the robustness-performance trade-off.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., attack success rates under the strongest defense) rather than remaining purely qualitative.
- [§3] Notation for threat models (white-box vs. gray-box) is introduced but not consistently used when presenting results; a table mapping each defense to its feasible threat model would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on threat models and statistical reporting.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): The central claim that 'the weakness of existing discrete optimizers for text... makes standard adaptive attacks more challenging' requires that the reported attacks were adapted to each defense (e.g., by placing the perplexity filter, paraphrase step, or retokenization inside the discrete search loop or via a surrogate). The methodology description does not specify this; if attacks were optimized only against the undefended model and then transferred, the observed robustness is consistent with non-adaptation rather than inherent optimizer limits.
Authors: We acknowledge that our attack optimizations were performed against the undefended base model, with defenses applied only at evaluation time for detection and preprocessing methods (adversarial training was incorporated during optimization). This setup demonstrates transferability of attacks rather than full adaptivity. We agree this means the results do not conclusively prove inherent optimizer limits against adaptive attacks. We will revise §4 to explicitly describe the methodology and threat models, adjust the central claim to focus on the observed challenges with transferred attacks and high optimization costs, and add discussion of why full adaptive attacks (with defenses in the loop) remain computationally difficult. revision: partial
-
Referee: [§4.1] §4.1 and associated tables: No error bars, number of random seeds, or statistical significance tests are reported for attack success rates. Given the stochastic nature of both the optimizer and the LLM sampling, single-run numbers are insufficient to support claims about relative defense strength or the robustness-performance trade-off.
Authors: We agree that variability reporting is needed given the stochastic elements. Experiments used a single fixed seed per configuration for reproducibility, without multiple independent runs due to high computational costs of discrete optimization. In the revision we will explicitly state the number of seeds (one), add a limitations discussion, include error bars or variability notes where preliminary multi-run data exists, and perform basic statistical comparisons on the reported success rates. revision: yes
Circularity Check
No circularity: empirical evaluation with independent experimental comparisons
full rationale
The paper presents an empirical study of baseline defenses (perplexity detection, paraphrase/retokenization preprocessing, adversarial training) against text-based adversarial attacks on LLMs. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Central claims rest on reported attack success rates and robustness-performance trade-offs from experiments in white-box/gray-box settings. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes are present. The finding on discrete optimizer weakness follows directly from the observed experimental outcomes rather than reducing to any input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Threat models and attack methods from computer vision transfer meaningfully to discrete text optimization in LLMs
Forward citations
Cited by 24 Pith papers
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
-
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
-
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
-
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models...
-
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Reference graph
Works this paper leans on
-
[1]
Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples . In Proceedings of the 35th International Conference on Machine Learning , pp.\ 274--283. PMLR , July 2018. URL https://proceedings.mlr.press/v80/athalye18a.html
work page 2018
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Enhancing robustness of machine learning systems via data transformations
Arjun Nitin Bhagoji, Daniel Cullina, Chawin Sitawarin, and Prateek Mittal. Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pp.\ 1--5. IEEE, 2018
work page 2018
-
[5]
Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods
Nicholas Carlini and David Wagner. Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods . In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , AISec '17, pp.\ 3--14, New York, NY, USA , November 2017. Association for Computing Machinery . ISBN 978-1-4503-5202-4. doi:10.1145/3128572.3140444. URL https:...
-
[6]
On Evaluating Adversarial Robustness
Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On Evaluating Adversarial Robustness . arxiv:1902.06705[cs, stat], February 2019. doi:10.48550/arXiv.1902.06705. URL http://arxiv.org/abs/1902.06705
-
[7]
Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo , Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? arxiv:2306.15447[cs], June 2023. doi:10.48550/arXiv.2306.15447. URL http://arxiv.org/abs/2306.15447
-
[8]
Explore, Establish, Ex- ploit: Red Teaming Language Models from Scratch
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell . Explore, Establish , Exploit : Red Teaming Language Models from Scratch . arxiv:2306.09442[cs], June 2023. doi:10.48550/arXiv.2306.09442. URL http://arxiv.org/abs/2306.09442
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[10]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023
-
[12]
H ot F lip: White-box adversarial examples for text classification
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. H ot F lip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 31--36, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-2006. UR...
-
[13]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Black-box generation of adversarial text sequences to evade deep learning classifiers
Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp.\ 50--56. IEEE, 2018
work page 2018
-
[15]
Shumailov, Kassem Fawaz, and Nicolas Papernot
Yue Gao, I. Shumailov, Kassem Fawaz, and Nicolas Papernot. On the Limitations of Stochastic Pre-processing Defenses . In Advances in Neural Information Processing Systems , volume 35, pp.\ 24280--24294, December 2022. URL https://proceedings.neurips.cc/paper\_files/paper/2022/hash/997089469acbeb410405e43f0011be1f-Abstract-Conference.html
-
[16]
Breaking certified defenses: Semantic adversarial examples with spoofed robustness certificates
Amin Ghiasi, Ali Shafahi, and Tom Goldstein. Breaking certified defenses: Semantic adversarial examples with spoofed robustness certificates. In International Conference on Learning Representations, 2019
work page 2019
-
[17]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022
work page internal anchor Pith review arXiv 2022
-
[18]
Adversarial attacks on machine learning systems for high-frequency trading
Micah Goldblum, Avi Schwarzschild, Ankit Patel, and Tom Goldstein. Adversarial attacks on machine learning systems for high-frequency trading. In Proceedings of the Second ACM International Conference on AI in Finance, pp.\ 1--9, 2021
work page 2021
-
[19]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[20]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection . arxiv:2302.12173[cs], May 2023. doi:10.48550/arXiv.2302.12173. URL http://arxiv.org/abs/2302.12173
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.12173 2023
-
[21]
On the ( Statistical ) Detection of Adversarial Examples
Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the ( Statistical ) Detection of Adversarial Examples . arxiv:1702.06280[cs, stat], October 2017. doi:10.48550/arXiv.1702.06280. URL http://arxiv.org/abs/1702.06280
-
[22]
Towards deep neural network architectures robust to adversarial examples
Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014
-
[23]
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo, Alexandre Sablayrolles, Herv \'e J \'e gou, and Douwe Kiela. Gradient-based Adversarial Attacks against Text Transformers . arxiv:2104.13733[cs], April 2021. doi:10.48550/arXiv.2104.13733. URL http://arxiv.org/abs/2104.13733
-
[24]
Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved Problems in ML Safety . arxiv:2109.13916[cs], June 2022. doi:10.48550/arXiv.2109.13916. URL http://arxiv.org/abs/2109.13916
-
[25]
Hossein Hosseini and Radha Poovendran. Semantic adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.\ 1614--1619, 2018
work page 2018
-
[26]
Bring your own data! self-supervised evaluation for large language models
Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Bring your own data! self-supervised evaluation for large language models. arXiv preprint arXiv:2306.13651, 2023
-
[27]
Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs : Dual-Use Through Standard Security Attacks . arxiv:2302.05733[cs], February 2023. doi:10.48550/arXiv.2302.05733. URL http://arxiv.org/abs/2302.05733
-
[28]
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023
-
[29]
Pretraining language models with human preferences
Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International Conference on Machine Learning, pp.\ 17506--17533. PMLR, 2023
work page 2023
-
[30]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007. URL https://a...
-
[31]
Multi-step jailbreaking privacy attacks on chatgpt
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT . arxiv:2304.05197[cs], May 2023. doi:10.48550/arXiv.2304.05197. URL http://arxiv.org/abs/2304.05197
-
[32]
TextBugger: Generating Adversarial Text Against Real-world Applications
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018
work page Pith review arXiv 2018
-
[33]
BERT - ATTACK : Adversarial attack against BERT using BERT
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT - ATTACK : Adversarial attack against BERT using BERT . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6193--6202, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.500. URL https:...
-
[34]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
FLIRT : Feedback Loop In-context Red Teaming
Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, and Rahul Gupta. FLIRT : Feedback Loop In-context Red Teaming . arxiv:2308.04265[cs], August 2023. doi:10.48550/arXiv.2308.04265. URL http://arxiv.org/abs/2308.04265
-
[36]
Magnet: a two-pronged defense against adversarial examples
Dongyu Meng and Hao Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp.\ 135--147, 2017 a
work page 2017
-
[37]
MagNet : A Two-Pronged Defense against Adversarial Examples
Dongyu Meng and Hao Chen. MagNet : A Two-Pronged Defense against Adversarial Examples . In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , CCS '17, pp.\ 135--147, New York, NY, USA , October 2017 b . Association for Computing Machinery . ISBN 978-1-4503-4946-8. doi:10.1145/3133956.3134057. URL https://dl.acm.org/doi...
-
[38]
On detecting adversarial perturbations
Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017
-
[39]
Randomized smoothing with masked inference for adversarially robust text classifications
Han Cheol Moon, Shafiq Joty, Ruochen Zhao, Megh Thakkar, and Xu Chi. Randomized smoothing with masked inference for adversarially robust text classifications. arXiv preprint arXiv:2305.06522, 2023
-
[40]
Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp
John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. TextAttack : A Framework for Adversarial Attacks , Data Augmentation , and Adversarial Training in NLP . arXiv:2005.05909 [cs], October 2020. URL http://arxiv.org/abs/2005.05909. arXiv: 2005.05909
-
[41]
Diffusion Models for Adversarial Purification
Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Animashree Anandkumar. Diffusion Models for Adversarial Purification . In Proceedings of the 39th International Conference on Machine Learning , pp.\ 16805--16827. PMLR , June 2022. URL https://proceedings.mlr.press/v162/nie22a.html
work page 2022
-
[42]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[43]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Ignore Previous Prompt: Attack Techniques For Language Models
F \'a bio Perez and Ian Ribeiro. Ignore Previous Prompt : Attack Techniques For Language Models . arxiv:2211.09527[cs], November 2022. doi:10.48550/arXiv.2211.09527. URL http://arxiv.org/abs/2211.09527
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09527 2022
-
[45]
Bpe-dropout: Simple and effective subword regularization
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267, 2019
-
[46]
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual Adversarial Examples Jailbreak Aligned Large Language Models . In The Second Workshop on New Frontiers in Adversarial Machine Learning , August 2023. URL https://openreview.net/forum?id=cZ4j7L6oui
work page 2023
-
[47]
Data Augmentation Can Improve Robustness
Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann. Data Augmentation Can Improve Robustness . In Advances in Neural Information Processing Systems , volume 34, pp.\ 29935--29948. Curran Associates, Inc. , 2021. URL https://proceedings.neurips.cc/paper/2021/hash/fb4c48608ce8825b558ccf07169a3421-Abst...
work page 2021
-
[48]
Defense-gan: Protecting classifiers against adversarial attacks using generative models
Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018
-
[49]
Adversarial training for free! In Advances in Neural Information Processing Systems , volume 32
Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc. , 2019. URL https://proceedings.neurips.cc/paper\_files/paper/2019/hash/7503cfacd12053d309b6be...
work page 2019
-
[50]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023
-
[51]
Autoprompt: Eliciting knowledge from language models wit h automatically generated prompts,
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020
-
[52]
Intriguing properties of neural networks
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
- [53]
-
[54]
Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023
MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b
work page 2023
-
[55]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Detecting Adversarial Examples Is ( Nearly ) As Hard As Classifying Them
Florian Tramer. Detecting Adversarial Examples Is ( Nearly ) As Hard As Classifying Them . In Proceedings of the 39th International Conference on Machine Learning , pp.\ 21692--21702. PMLR , June 2022. URL https://proceedings.mlr.press/v162/tramer22a.html
work page 2022
-
[57]
On Adaptive Attacks to Adversarial Example Defenses
Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On Adaptive Attacks to Adversarial Example Defenses . In Advances in Neural Information Processing Systems , volume 33, pp.\ 1633--1645. Curran Associates, Inc. , 2020. URL https://proceedings.neurips.cc/paper/2020/hash/11f38f8ecd71867b42433548d1078e38-Abstract.html
work page 2020
-
[58]
Universal adversarial triggers for attacking and analyzing NLP
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2153--2162, Hong Kong, China, Novem...
-
[59]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard Prompts Made Easy : Gradient-Based Discrete Optimization for Prompt Tuning and Discovery . arXiv preprint arXiv:2302.03668, February 2023. URL https://arxiv.org/abs/2302.03668v1
-
[61]
Making an invisibility cloak: Real world adversarial attacks on object detectors
Zuxuan Wu, Ser-Nam Lim, Larry S Davis, and Tom Goldstein. Making an invisibility cloak: Real world adversarial attacks on object detectors. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16, pp.\ 1--17. Springer, 2020
work page 2020
-
[62]
Adversarial examples: Attacks and defenses for deep learning
Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30 0 (9): 0 2805--2824, 2019
work page 2019
-
[63]
Certified robustness for large language models with self-denoising
Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu, Yang Zhang, and Shiyu Chang. Certified robustness for large language models with self-denoising. arXiv preprint arXiv:2307.07171, 2023
-
[64]
Freelb: Enhanced adversarial training for natural language understanding
Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019
-
[65]
PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. PromptBench : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts . arxiv:2306.04528[cs], June 2023. doi:10.48550/arXiv.2306.04528. URL http://arxiv.org/abs/2306.04528
-
[66]
Zimmermann, Wieland Brendel, Florian Tramer, and Nicholas Carlini
Roland S. Zimmermann, Wieland Brendel, Florian Tramer, and Nicholas Carlini. Increasing Confidence in Adversarial Robustness Evaluations . In Advances in Neural Information Processing Systems , volume 35, pp.\ 13174--13189, December 2022. URL https://proceedings.neurips.cc/paper\_files/paper/2022/hash/5545d9bcefb7d03d5ad39a905d14fbe3-Abstract-Conference.html
work page 2022
-
[67]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models . arxiv:2307.15043[cs], July 2023. doi:10.48550/arXiv.2307.15043. URL http://arxiv.org/abs/2307.15043
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15043 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.