arxiv: 2308.14132 · v3 · submitted 2023-08-27 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Detecting Language Model Attacks with Perplexity

Gabriel Alon , Michael Kamfonas

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords adversarial attacksperplexityjailbreakslanguage modelsdetectionGPT-2LightGBMsuffix attacks

0 comments

The pith

Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prompts containing adversarial suffixes for jailbreaking LLMs show markedly elevated perplexity when scored by GPT-2. Normal prompts across many styles sometimes match those high scores and create false alarms, so the authors train a Light-GBM model that uses both perplexity and token count to separate the two cases. In their test set this classifier flags the majority of the adversarial examples while keeping false positives low. The approach therefore supplies an early filter that can block harmful queries before they reach the main model. If the pattern holds, it means perplexity measured on an open model gives a practical signal for spotting manipulated inputs.

Core claim

The authors establish that adversarial suffixes produce exceedingly high perplexity values under GPT-2. They demonstrate that while plain perplexity filtering faces significant false positives from varied normal prompts, a Light-GBM classifier trained on perplexity and token length correctly identifies most adversarial attacks in their test set.

What carries the argument

Perplexity score from GPT-2 on the full query, paired with token length as features for a Light-GBM classifier that separates adversarial from normal prompts.

If this is right

Perplexity checks can be inserted as an early filter to block many jailbreak attempts before they reach the target LLM.
Adding token length to the classifier reduces errors caused by unusual but benign user prompts.
An open-source model like GPT-2 can act as the detector without any access to the target model's parameters or responses.
The method limits exposure to prompts that request instructions for explosives, theft, or other harmful content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attackers could eventually discover suffixes that keep perplexity low under GPT-2, requiring retraining or replacement of the detector.
The same perplexity-plus-length approach might be tested on other open models if GPT-2 loses effectiveness against new attacks.
Combining the classifier with downstream checks on the model's generated output could raise overall resistance to evolving jailbreaks.

Load-bearing premise

The collection of regular prompts used to measure false positives reflects real-world variety, and future attackers will not adapt their suffixes to produce low perplexity under GPT-2.

What would settle it

Generation of adversarial suffixes that achieve low perplexity under GPT-2 yet still succeed in jailbreaking the target model would show the detection method fails.

read the original abstract

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows GPT-2 perplexity is high on the fixed jailbreak suffixes they tested and a simple LightGBM on perplexity plus length separates them from normal prompts, but it skips testing whether attackers can optimize suffixes to stay low-perplexity.

read the letter

The core finding is that the adversarial suffixes from the attacks they collected produce high perplexity under GPT-2, and adding token length to a LightGBM classifier cuts down false positives while still catching most of those attacks on the held-out set. That is a concrete, low-cost signal worth noting for anyone running input filters on deployed models. The work is empirical and direct: they measure perplexity with an open model, train a lightweight classifier, and report that it works on the data they have. No heavy theory or new architecture, just a practical observation that holds for the non-adaptive cases they examined. Credit for keeping the method simple and reproducible with public components. The soft spot is the lack of any adaptive attack test. The paper does not try to generate suffixes that minimize GPT-2 perplexity while preserving jailbreak success, so we do not know how brittle the signal is once an attacker knows the detector. Their normal-prompt distribution also looks limited for claiming low false-positive rates in the wild. Those gaps are real but not fatal for an early empirical note; they just mean the result is narrower than the abstract suggests. This is the sort of paper a practitioner would read for ideas on quick filters, and a referee could usefully push on the adaptive case and on how the normal-prompt corpus was chosen. It deserves peer review because the observation is falsifiable and the method is cheap to replicate, even if the current evidence is only a first step.

Referee Report

2 major / 2 minor

Summary. The paper claims that queries containing adversarial suffixes for LLM jailbreaks exhibit high perplexity under GPT-2, that plain perplexity filtering produces many false positives on diverse normal prompts, and that a Light-GBM classifier using perplexity plus token length resolves those false positives while correctly detecting most attacks in the test set.

Significance. If the detection remains reliable, the approach supplies a lightweight, external-model filter that requires no access to the target LLM and could be deployed as a first-stage guardrail; the empirical separation shown for the tested attacks is a concrete, immediately usable signal.

major comments (2)

[Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.
[Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.

minor comments (2)

[Data] Specify the exact distribution and size of the regular (non-adversarial) prompt corpus used to measure false-positive rates.
[Methods] Report the Light-GBM hyper-parameters, training/validation split, and feature-importance values to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.

Authors: We agree that evaluating against adaptive attacks designed to minimize GPT-2 perplexity is important for assessing the general utility of the detection method. Our current work focuses on the adversarial suffixes as published in the source papers, which already exhibit high perplexity. We will add a new subsection in the discussion to explicitly acknowledge this limitation and suggest future experiments using optimization techniques like gradient search to generate low-perplexity jailbreaks. We believe the observed separation for existing attacks still demonstrates the potential of perplexity-based detection as an initial filter. revision: partial
Referee: [Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.

Authors: We apologize for the lack of detailed quantitative reporting. We will update the abstract and results section to include the test-set composition (number of normal and adversarial prompts), the exact performance metrics of the LightGBM classifier (including precision, recall, and AUC), comparisons against the perplexity-only baseline, and an error analysis highlighting the types of false positives encountered with plain perplexity filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline using external model

full rationale

The paper computes perplexity on queries using a fixed external open-source LLM (GPT-2), observes high values for adversarial suffixes, and trains a separate Light-GBM classifier on the resulting perplexity values plus token length. No equations, self-citations, or derivations reduce the detection claim to a fitted parameter or prior result by construction. The approach is a standard data-driven ML pipeline whose central claim rests on observable differences between the tested adversarial and regular prompt distributions rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that perplexity computed by GPT-2 serves as a stable signal for adversarial text and that the chosen classifier features generalize beyond the authors' test distribution.

free parameters (1)

LightGBM hyperparameters
Hyperparameters of the LightGBM model are chosen or tuned on the data; exact values not stated in abstract.

axioms (1)

domain assumption Perplexity from GPT-2 reliably distinguishes adversarial suffixes from normal text when combined with length.
Invoked when claiming that high perplexity plus the classifier solves the false-positive problem.

pith-pipeline@v0.9.0 · 5419 in / 1246 out tokens · 54134 ms · 2026-05-15T13:55:22.051094+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use the Greedy Coordinate Gradient (GCG) algorithm described in (Zou et al., 2023). We treat it as a black box

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
cs.CR 2026-04 unverdicted novelty 8.0 full

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
cs.AI 2026-05 conditional novelty 7.0

BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
cs.CR 2026-05 unverdicted novelty 6.0

Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
Test-Time Safety Alignment
cs.CL 2026-04 unverdicted novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
cs.CR 2026-04 unverdicted novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
SALLIE: Safeguarding Against Latent Language & Image Exploits
cs.CR 2026-04 unverdicted novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 19 Pith papers · 4 internal anchors

[1]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[3]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019

work page 2019
[4]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 1310--1320. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.pre...

work page 2019
[5]

Monitor alarm fatigue: an integrative review

Maria Cvach. Monitor alarm fatigue: an integrative review. Biomedical instrumentation & technology, 2012

work page 2012
[6]

Improving alignment of dialogue agents via targeted human judgments, 2022

Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...

work page 2022
[7]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015

work page 2015
[9]

Unsolved problems in ml safety, 2022

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022

work page 2022
[10]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[11]

perplexity, howpublished = https://huggingface.co/docs/transformers/perplexity , note = Accessed: 2023-08-26 , 2023

Huggingface. perplexity, howpublished = https://huggingface.co/docs/transformers/perplexity , note = Accessed: 2023-08-26 , 2023

work page 2023
[12]

Baseline defenses for adversarial attacks against aligned language models, 2023

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models, 2023

work page 2023
[13]

Jaramilo:gpt4jailbreak, howpublished = https://huggingface.co/datasets/rubend18/chatgpt-jailbreak-prompts , 2023

Rubén Darío Jaramillo. Jaramilo:gpt4jailbreak, howpublished = https://huggingface.co/datasets/rubend18/chatgpt-jailbreak-prompts , 2023. Accessed: 2023-09-20

work page 2023
[14]

Automatically auditing large language models via discrete optimization, 2023

Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization, 2023

work page 2023
[15]

Open sesame! universal black box jailbreaking of large language models, 2023

Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models, 2023

work page 2023
[16]

Lee, Cole J

Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms, 2023

work page 2023
[17]

Globally-robust neural networks, 2021

Klas Leino, Zifan Wang, and Matt Fredrikson. Globally-robust neural networks, 2021

work page 2021
[18]

Rain: Your language models can align themselves without finetuning, 2023

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning, 2023

work page 2023
[19]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb

work page 2018
[20]

Tapir: Trigger action platform for information retrieval

Annunziata Elefante Mattia Limone, Gaetano Cimino. Tapir: Trigger action platform for information retrieval. https://github.com/MattiaLimone/ifttt_recommendation_system, 2023

work page 2023
[21]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

work page 2018
[22]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[23]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[24]

Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, 2016

Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, 2016

work page 2016
[25]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In NA. OpenAI, 2019. URL https://api.semanticscholar.org/CorpusID:160025533

work page 2019
[26]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Nay, Kshitij Gupta, and Aran Komatsuzaki

Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, and Aran Komatsuzaki. Arb: Advanced reasoning benchmark for large language models, 2023

work page 2023
[28]

Logan IV au2, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020

work page 2020
[29]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, and Nikolay Bashlykov. Llama 2: Open foundation and fine-tuned chat models, 2023

work page 2023
[31]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023

work page 2023
[32]

Jailbroken: How does llm safety training fail?, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023

work page 2023
[33]

Fundamental limitations of alignment in large language models, 2023

Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models, 2023

work page 2023
[34]

Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text, 2018

Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text, 2018

work page 2018
[36]

Reclor: A reading comprehension dataset requiring logical reasoning

Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations (ICLR), April 2020

work page 2020
[37]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

work page 2023
[38]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[39]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023

work page 2023
[40]

Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

work page 2014
[41]

NA , url=

Language Models are Unsupervised Multitask Learners , author=. NA , url=. 2019 , organization=

work page 2019
[42]

Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

work page 2014
[43]

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

2019 , eprint=

Adversarial Examples Are Not Bugs, They Are Features , author=. 2019 , eprint=

work page 2019
[45]

International Conference on Learning Representations , year=

Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations , year=

work page
[46]

2020 , eprint=

AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts , author=. 2020 , eprint=

work page 2020
[47]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[48]

2023 , eprint=

Automatically Auditing Large Language Models via Discrete Optimization , author=. 2023 , eprint=

work page 2023
[49]

Adversarial Attack and Defense of Structured Prediction Models

Han, Wenjuan and Zhang, Liwen and Jiang, Yong and Tu, Kewei. Adversarial Attack and Defense of Structured Prediction Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.182

work page doi:10.18653/v1/2020.emnlp-main.182 2020
[50]

2018 , eprint=

DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text , author=. 2018 , eprint=

work page 2018
[51]

Certified Adversarial Robustness via Randomized Smoothing

Jeremy Cohen and Elan Rosenfeld and J. Zico Kolter , title =. CoRR , volume =. 2019 , url =. 1902.02918 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

2021 , eprint=

Globally-Robust Neural Networks , author=. 2021 , eprint=

work page 2021
[53]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[54]

gpt-xl, howpublished =

work page
[55]

perplexity, howpublished =

work page
[56]

vicuna7b, howpublished =

work page
[57]

Proceedings of the 36th International Conference on Machine Learning , pages =

Certified Adversarial Robustness via Randomized Smoothing , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[58]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[59]

Jaramilo:GPT4Jailbreak, howpublished =

Rubén Darío Jaramillo , publisher =. Jaramilo:GPT4Jailbreak, howpublished =

work page
[60]

Zhao:llmOpenDatasets, howpublished =

work page
[61]

GitHub repository , howpublished =

Mattia Limone, Gaetano Cimino, Annunziata Elefante , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[62]

2018 , eprint=

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

work page 2018
[63]

International Conference on Learning Representations (ICLR) , month =

Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi , title =. International Conference on Learning Representations (ICLR) , month =

work page
[64]

preprint arXiv:2305.12524 , year=

TheoremQA: A Theorem-driven Question Answering dataset , author=. preprint arXiv:2305.12524 , year=

work page arXiv
[65]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page
[66]

2307.13692 , archivePrefix=

ARB: Advanced Reasoning Benchmark for Large Language Models , author=. 2307.13692 , archivePrefix=

work page arXiv
[67]

2308.07317 , archivePrefix=

Platypus: Quick, Cheap, and Powerful Refinement of LLMs , author=. 2308.07317 , archivePrefix=

work page arXiv
[68]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[69]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page
[70]

2023 , eprint=

Fundamental Limitations of Alignment in Large Language Models , author=. 2023 , eprint=

work page 2023
[71]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

work page 2023
[72]

2022 , eprint=

Unsolved Problems in ML Safety , author=. 2022 , eprint=

work page 2022
[73]

, author=

Monitor alarm fatigue: an integrative review. , author=. 2012 , howpublished=

work page 2012
[74]

2015 , eprint=

Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

work page 2015
[75]

2016 , eprint=

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , author=. 2016 , eprint=

work page 2016
[76]

2023 , eprint=

Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2023 , eprint=

work page 2023
[77]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[78]

2023 , eprint=

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher , author=. 2023 , eprint=

work page 2023
[79]

Improving alignment of dialogue agents via targeted human judgments , author=

work page
[80]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[81]

2023 , eprint=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=

work page 2023
[82]

2023 , eprint=

Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. 2023 , eprint=

work page 2023
[83]

D oc RED : A Large-Scale Document-Level Relation Extraction Dataset

Yao, Yuan and Ye, Deming and Li, Peng and Han, Xu and Lin, Yankai and Liu, Zhenghao and Liu, Zhiyuan and Huang, Lixin and Zhou, Jie and Sun, Maosong. D oc RED : A Large-Scale Document-Level Relation Extraction Dataset. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1074

work page doi:10.18653/v1/p19-1074 2019
[84]

NAACL , year=

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. NAACL , year=

work page

Showing first 80 references.