Universal and Transferable Adversarial Attacks on Aligned Language Models
Pith reviewed 2026-05-24 07:39 UTC · model grok-4.3
The pith
An automatically found adversarial suffix transfers to make aligned LLMs including ChatGPT generate objectionable content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an adversarial suffix trained on multiple objectionable-content prompts and on Vicuna-7B plus Vicuna-13B, using a combination of greedy and gradient-based search to maximize the probability of an affirmative response, induces objectionable content across a wide range of aligned models including the black-box public interfaces to ChatGPT, Bard, and Claude as well as open models such as LLaMA-2-Chat, Pythia, and Falcon.
What carries the argument
The adversarial attack suffix, a string optimized by greedy and gradient-based search to maximize affirmative-response probability on objectionable queries.
If this is right
- Aligned language models can be induced to produce objectionable content by an automatically generated suffix without manual prompt engineering.
- The same suffix works on black-box public interfaces of models not used in training.
- Current alignment procedures in both open and closed models remain vulnerable to this form of attack.
- The approach raises concrete questions about how to prevent aligned systems from generating disallowed information.
Where Pith is reading between the lines
- Alignment training may need to include exposure to suffixes discovered by gradient search on other models.
- Safety benchmarks for new models should test transfer from publicly available open models.
- Defenses could be evaluated by checking whether they block suffixes found on Vicuna-scale models.
Load-bearing premise
The optimization finds suffixes whose success on new prompts and new models is not simply overfitting to the training prompts and the two Vicuna models used during search.
What would settle it
Applying the published suffix to a fresh set of prompts or to a model never seen during training and observing that it no longer elicits objectionable content would falsify the transferability claim.
Figures
read the original abstract
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an automated attack that uses a combination of greedy and gradient-based search to discover a fixed adversarial suffix; when this suffix is appended to a wide range of user queries that request objectionable content, the probability that an aligned LLM produces an affirmative (rather than refusing) response is maximized. The central empirical claim is that a single suffix trained on multiple objectionable prompts and on only Vicuna-7B/13B transfers to held-out prompts and to a broad set of unseen models, including the public interfaces of ChatGPT, Bard, and Claude as well as LLaMA-2-Chat, Pythia, Falcon and others.
Significance. If the transfer results hold under more rigorous controls, the work would constitute a substantial advance in the empirical study of LLM alignment robustness by showing that a simple, fully automatic procedure can produce suffixes that are effective against both open-weight and closed commercial models. The public code release is a clear strength that enables direct verification and extension.
major comments (2)
- [§4] §4 (Experiments), transfer tables: the reported success rates on held-out prompts and on ChatGPT/Bard/Claude are not accompanied by an ablation that trains a control suffix on a disjoint set of non-objectionable or unrelated prompts; without this control it remains possible that the observed transfer is explained by the suffix learning low-level affirmative-token patterns that happen to be shared across the training distribution and the target models rather than a general attack mechanism.
- [§4 and Appendix] Evaluation protocol (throughout §4 and Appendix): the manuscript does not specify how 'objectionable' model outputs were labeled (keyword matching, human raters, or automated classifier) nor reports inter-rater agreement; this leaves the quantitative success rates open to evaluation subjectivity and makes it difficult to assess whether the transfer numbers are robust to alternative labeling schemes.
minor comments (2)
- [§3] The description of the combined greedy+gradient search procedure would benefit from an explicit pseudocode listing of the inner loop (including how the top-k candidates are selected at each greedy step).
- [Figure 1] Figure 1 and the associated text should clarify whether the visualized suffixes are the final multi-prompt, multi-model suffixes or single-prompt examples.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive feedback. We address the two major comments point-by-point below. Both points identify areas where additional controls and clarification will strengthen the manuscript, and we will revise accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experiments), transfer tables: the reported success rates on held-out prompts and on ChatGPT/Bard/Claude are not accompanied by an ablation that trains a control suffix on a disjoint set of non-objectionable or unrelated prompts; without this control it remains possible that the observed transfer is explained by the suffix learning low-level affirmative-token patterns that happen to be shared across the training distribution and the target models rather than a general attack mechanism.
Authors: We agree that an ablation training a control suffix on non-objectionable prompts would help isolate whether the learned suffix exploits a general attack mechanism versus low-level affirmative-token biases. In the revised manuscript we will add this control experiment, training a suffix on an equal number of benign prompts and reporting its transfer success rates on the same held-out objectionable queries and target models. This will allow direct comparison with the main results. revision: yes
-
Referee: [§4 and Appendix] Evaluation protocol (throughout §4 and Appendix): the manuscript does not specify how 'objectionable' model outputs were labeled (keyword matching, human raters, or automated classifier) nor reports inter-rater agreement; this leaves the quantitative success rates open to evaluation subjectivity and makes it difficult to assess whether the transfer numbers are robust to alternative labeling schemes.
Authors: The referee correctly notes that the labeling procedure is not described in sufficient detail. In the revision we will add an explicit subsection in §4 and the appendix describing the evaluation protocol: refusal detection begins with a fixed set of refusal keywords, followed by manual review of borderline cases by two authors with reported inter-rater agreement (Cohen’s kappa). We will also release the exact keyword list and annotation guidelines alongside the existing code. revision: yes
Circularity Check
No circularity; empirical method with external validation
full rationale
The paper describes an empirical procedure that optimizes an adversarial suffix via greedy and gradient search on a set of prompts and Vicuna models, then evaluates transfer on disjoint prompts and entirely separate models (ChatGPT, Bard, Claude, LLaMA-2, etc.). No equation, loss term, or claimed derivation reduces the reported success rate to a parameter fitted from the same data by construction, nor does any load-bearing step rest on a self-citation chain whose validity is presupposed. The central result is therefore an external empirical demonstration rather than a self-referential identity.
Axiom & Free-Parameter Ledger
free parameters (1)
- search hyperparameters (greedy steps, gradient steps, suffix length)
axioms (1)
- domain assumption The target model's output distribution provides a differentiable signal that can be used to increase probability of affirmative continuation.
Forward citations
Cited by 60 Pith papers
-
Who Owns This Agent? Tracing AI Agents Back to Their Owners
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
-
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and ...
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
-
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis
Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.
-
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
-
A First Look at the Security Issues in the Model Context Protocol Ecosystem
Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious d...
-
Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem
This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers r...
-
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
-
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
-
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, a...
-
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexit...
-
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
-
Measuring Safety Alignment Effects in Autonomous Security Agents
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...
-
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the ...
-
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
-
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
-
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-depend...
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
LLMs suppress factual corrections in task contexts despite internal knowledge of errors, with two training-free interventions shown to increase correction rates substantially.
-
Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses
SADBench is a new benchmark that systematically tests steganography attacks with harmful image and text payloads against steganalysis defenses, revealing stable attack methods, near-perfect in-domain detection, transf...
-
On the Hardness of Junking LLMs
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
-
Self-Mined Hardness for Safety Fine-Tuning
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates su...
Reference graph
Works this paper leans on
-
[1]
Generating Natural Language Adversarial Examples
URL https://openreview.net/forum?id=l5aSHXi8jG5. Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 , 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi´ c, Pavel Laskov, Gior- gio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Ma- chine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13 , pa...
work page 2013
-
[4]
Adversarial examples are not easily detected: Bypassing ten detection methods
Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 3–14, 2017a. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks, 2017b. Nicholas Carlini, Milad Nasr, Christopher A Cho...
-
[5]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
HotFlip: White-Box Adversarial Examples for Text Classification
22 Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gradient-based adversarial attacks against text transform- ers.arXiv:2104.13733,
Chuan Guo, Alexandre Sablayrolles, Herv´ e J´ egou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733 ,
-
[10]
Adversarial Examples for Evaluating Reading Comprehension Systems
Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Automatically auditing large language models via discrete optimization
Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381 ,
-
[12]
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Sok: Certified robustness for deep neural networks
Linyi Li, Tao Xie, and Bo Li. Sok: Certified robustness for deep neural networks. In 2023 IEEE Symposium on Security and Privacy (SP) ,
work page 2023
-
[15]
Exploring targeted universal adversarial perturbations to end-to-end asr models
Zhiyun Lu, Wei Han, Yu Zhang, and Liangliang Cao. Exploring targeted universal adversarial perturbations to end-to-end asr models. arXiv preprint arXiv:2104.02757 ,
-
[16]
Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner
URL https://openreview.net/forum?id=rJzIBfZAb. Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237 ,
-
[17]
Universal adversarial perturbations for speech recognition systems
Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Fari- naz Koushanfar. Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828,
-
[18]
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 , 2016a. Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Anan- thram Swami. The limitations of deep learning in adversarial settings...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
URL https://arxiv.org/abs/2306.01116. Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. Cold decoding: Energy-based con- strained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35:9538–9551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980,
-
[21]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foun- dation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
The Space of Transferable Adversarial Examples
Florian Tram` er, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Universal adversarial triggers for attacking and analyz- ing nlp.arXiv:1908.07125,
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125 ,
-
[24]
Adversarial glue: A multi-task benchmark for robustness evaluation of language models
24 Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840 ,
-
[25]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668 ,
-
[27]
Fundamental limitations of alignment in large language models
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 ,
-
[28]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Promptbench: Towards evaluating the robustness of large language models on adversarial prompts
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 ,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.