pith. machine review for the scientific record. sign in

arxiv: 2308.14132 · v3 · submitted 2023-08-27 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Detecting Language Model Attacks with Perplexity

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords adversarial attacksperplexityjailbreakslanguage modelsdetectionGPT-2LightGBMsuffix attacks
0
0 comments X

The pith

Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prompts containing adversarial suffixes for jailbreaking LLMs show markedly elevated perplexity when scored by GPT-2. Normal prompts across many styles sometimes match those high scores and create false alarms, so the authors train a Light-GBM model that uses both perplexity and token count to separate the two cases. In their test set this classifier flags the majority of the adversarial examples while keeping false positives low. The approach therefore supplies an early filter that can block harmful queries before they reach the main model. If the pattern holds, it means perplexity measured on an open model gives a practical signal for spotting manipulated inputs.

Core claim

The authors establish that adversarial suffixes produce exceedingly high perplexity values under GPT-2. They demonstrate that while plain perplexity filtering faces significant false positives from varied normal prompts, a Light-GBM classifier trained on perplexity and token length correctly identifies most adversarial attacks in their test set.

What carries the argument

Perplexity score from GPT-2 on the full query, paired with token length as features for a Light-GBM classifier that separates adversarial from normal prompts.

If this is right

  • Perplexity checks can be inserted as an early filter to block many jailbreak attempts before they reach the target LLM.
  • Adding token length to the classifier reduces errors caused by unusual but benign user prompts.
  • An open-source model like GPT-2 can act as the detector without any access to the target model's parameters or responses.
  • The method limits exposure to prompts that request instructions for explosives, theft, or other harmful content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attackers could eventually discover suffixes that keep perplexity low under GPT-2, requiring retraining or replacement of the detector.
  • The same perplexity-plus-length approach might be tested on other open models if GPT-2 loses effectiveness against new attacks.
  • Combining the classifier with downstream checks on the model's generated output could raise overall resistance to evolving jailbreaks.

Load-bearing premise

The collection of regular prompts used to measure false positives reflects real-world variety, and future attackers will not adapt their suffixes to produce low perplexity under GPT-2.

What would settle it

Generation of adversarial suffixes that achieve low perplexity under GPT-2 yet still succeed in jailbreaking the target model would show the detection method fails.

read the original abstract

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that queries containing adversarial suffixes for LLM jailbreaks exhibit high perplexity under GPT-2, that plain perplexity filtering produces many false positives on diverse normal prompts, and that a Light-GBM classifier using perplexity plus token length resolves those false positives while correctly detecting most attacks in the test set.

Significance. If the detection remains reliable, the approach supplies a lightweight, external-model filter that requires no access to the target LLM and could be deployed as a first-stage guardrail; the empirical separation shown for the tested attacks is a concrete, immediately usable signal.

major comments (2)
  1. [Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.
  2. [Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.
minor comments (2)
  1. [Data] Specify the exact distribution and size of the regular (non-adversarial) prompt corpus used to measure false-positive rates.
  2. [Methods] Report the Light-GBM hyper-parameters, training/validation split, and feature-importance values to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.

    Authors: We agree that evaluating against adaptive attacks designed to minimize GPT-2 perplexity is important for assessing the general utility of the detection method. Our current work focuses on the adversarial suffixes as published in the source papers, which already exhibit high perplexity. We will add a new subsection in the discussion to explicitly acknowledge this limitation and suggest future experiments using optimization techniques like gradient search to generate low-perplexity jailbreaks. We believe the observed separation for existing attacks still demonstrates the potential of perplexity-based detection as an initial filter. revision: partial

  2. Referee: [Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.

    Authors: We apologize for the lack of detailed quantitative reporting. We will update the abstract and results section to include the test-set composition (number of normal and adversarial prompts), the exact performance metrics of the LightGBM classifier (including precision, recall, and AUC), comparisons against the perplexity-only baseline, and an error analysis highlighting the types of false positives encountered with plain perplexity filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline using external model

full rationale

The paper computes perplexity on queries using a fixed external open-source LLM (GPT-2), observes high values for adversarial suffixes, and trains a separate Light-GBM classifier on the resulting perplexity values plus token length. No equations, self-citations, or derivations reduce the detection claim to a fitted parameter or prior result by construction. The approach is a standard data-driven ML pipeline whose central claim rests on observable differences between the tested adversarial and regular prompt distributions rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that perplexity computed by GPT-2 serves as a stable signal for adversarial text and that the chosen classifier features generalize beyond the authors' test distribution.

free parameters (1)
  • LightGBM hyperparameters
    Hyperparameters of the LightGBM model are chosen or tuned on the data; exact values not stated in abstract.
axioms (1)
  • domain assumption Perplexity from GPT-2 reliably distinguishes adversarial suffixes from normal text when combined with length.
    Invoked when claiming that high perplexity plus the classifier solves the false-positive problem.

pith-pipeline@v0.9.0 · 5419 in / 1246 out tokens · 54134 ms · 2026-05-15T13:55:22.051094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

    cs.CR 2026-04 unverdicted novelty 8.0 full

    No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

  2. BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

    cs.AI 2026-05 conditional novelty 7.0

    BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.

  3. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  4. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  5. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  6. When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...

  7. Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

    cs.CR 2026-05 accept novelty 6.0

    JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...

  8. Test-Time Safety Alignment

    cs.CL 2026-04 unverdicted novelty 6.0

    Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

  9. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  10. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  11. SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.

  12. PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

    cs.CR 2026-04 unverdicted novelty 6.0

    PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

  13. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...

  14. Jailbreaking Black Box Large Language Models in Twenty Queries

    cs.LG 2023-10 conditional novelty 6.0

    PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

  15. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    cs.LG 2023-10 accept novelty 6.0

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  16. Re-Triggering Safeguards within LLMs for Jailbreak Detection

    cs.CR 2026-05 unverdicted novelty 5.0

    Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

  17. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

  18. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  19. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 19 Pith papers · 4 internal anchors

  1. [1]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  2. [3]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019

  3. [4]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 1310--1320. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.pre...

  4. [5]

    Monitor alarm fatigue: an integrative review

    Maria Cvach. Monitor alarm fatigue: an integrative review. Biomedical instrumentation & technology, 2012

  5. [6]

    Improving alignment of dialogue agents via targeted human judgments, 2022

    Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...

  6. [7]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015

  7. [9]

    Unsolved problems in ml safety, 2022

    Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022

  8. [10]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  9. [11]

    perplexity, howpublished = https://huggingface.co/docs/transformers/perplexity , note = Accessed: 2023-08-26 , 2023

    Huggingface. perplexity, howpublished = https://huggingface.co/docs/transformers/perplexity , note = Accessed: 2023-08-26 , 2023

  10. [12]

    Baseline defenses for adversarial attacks against aligned language models, 2023

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models, 2023

  11. [13]

    Jaramilo:gpt4jailbreak, howpublished = https://huggingface.co/datasets/rubend18/chatgpt-jailbreak-prompts , 2023

    Rubén Darío Jaramillo. Jaramilo:gpt4jailbreak, howpublished = https://huggingface.co/datasets/rubend18/chatgpt-jailbreak-prompts , 2023. Accessed: 2023-09-20

  12. [14]

    Automatically auditing large language models via discrete optimization, 2023

    Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization, 2023

  13. [15]

    Open sesame! universal black box jailbreaking of large language models, 2023

    Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models, 2023

  14. [16]

    Lee, Cole J

    Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms, 2023

  15. [17]

    Globally-robust neural networks, 2021

    Klas Leino, Zifan Wang, and Matt Fredrikson. Globally-robust neural networks, 2021

  16. [18]

    Rain: Your language models can align themselves without finetuning, 2023

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning, 2023

  17. [19]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb

  18. [20]

    Tapir: Trigger action platform for information retrieval

    Annunziata Elefante Mattia Limone, Gaetano Cimino. Tapir: Trigger action platform for information retrieval. https://github.com/MattiaLimone/ifttt_recommendation_system, 2023

  19. [21]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

  20. [22]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  21. [23]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

  22. [24]

    Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, 2016

    Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, 2016

  23. [25]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In NA. OpenAI, 2019. URL https://api.semanticscholar.org/CorpusID:160025533

  24. [26]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016

  25. [27]

    Nay, Kshitij Gupta, and Aran Komatsuzaki

    Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, and Aran Komatsuzaki. Arb: Advanced reasoning benchmark for large language models, 2023

  26. [28]

    Logan IV au2, Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020

  27. [29]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, and Nikolay Bashlykov. Llama 2: Open foundation and fine-tuned chat models, 2023

  28. [31]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023

  29. [32]

    Jailbroken: How does llm safety training fail?, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023

  30. [33]

    Fundamental limitations of alignment in large language models, 2023

    Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models, 2023

  31. [34]

    Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text, 2018

    Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text, 2018

  32. [36]

    Reclor: A reading comprehension dataset requiring logical reasoning

    Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations (ICLR), April 2020

  33. [37]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

  34. [38]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  35. [39]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023

  36. [40]

    Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

    Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

  37. [41]

    NA , url=

    Language Models are Unsupervised Multitask Learners , author=. NA , url=. 2019 , organization=

  38. [42]

    Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

    Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

  39. [43]

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

  40. [44]

    2019 , eprint=

    Adversarial Examples Are Not Bugs, They Are Features , author=. 2019 , eprint=

  41. [45]

    International Conference on Learning Representations , year=

    Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations , year=

  42. [46]

    2020 , eprint=

    AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts , author=. 2020 , eprint=

  43. [47]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  44. [48]

    2023 , eprint=

    Automatically Auditing Large Language Models via Discrete Optimization , author=. 2023 , eprint=

  45. [49]

    Adversarial Attack and Defense of Structured Prediction Models

    Han, Wenjuan and Zhang, Liwen and Jiang, Yong and Tu, Kewei. Adversarial Attack and Defense of Structured Prediction Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.182

  46. [50]

    2018 , eprint=

    DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text , author=. 2018 , eprint=

  47. [51]

    Certified Adversarial Robustness via Randomized Smoothing

    Jeremy Cohen and Elan Rosenfeld and J. Zico Kolter , title =. CoRR , volume =. 2019 , url =. 1902.02918 , timestamp =

  48. [52]

    2021 , eprint=

    Globally-Robust Neural Networks , author=. 2021 , eprint=

  49. [53]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  50. [54]

    gpt-xl, howpublished =

  51. [55]

    perplexity, howpublished =

  52. [56]

    vicuna7b, howpublished =

  53. [57]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Certified Adversarial Robustness via Randomized Smoothing , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  54. [58]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  55. [59]

    Jaramilo:GPT4Jailbreak, howpublished =

    Rubén Darío Jaramillo , publisher =. Jaramilo:GPT4Jailbreak, howpublished =

  56. [60]

    Zhao:llmOpenDatasets, howpublished =

  57. [61]

    GitHub repository , howpublished =

    Mattia Limone, Gaetano Cimino, Annunziata Elefante , title =. GitHub repository , howpublished =. 2023 , publisher =

  58. [62]

    2018 , eprint=

    Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

  59. [63]

    International Conference on Learning Representations (ICLR) , month =

    Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi , title =. International Conference on Learning Representations (ICLR) , month =

  60. [64]

    preprint arXiv:2305.12524 , year=

    TheoremQA: A Theorem-driven Question Answering dataset , author=. preprint arXiv:2305.12524 , year=

  61. [65]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  62. [66]

    2307.13692 , archivePrefix=

    ARB: Advanced Reasoning Benchmark for Large Language Models , author=. 2307.13692 , archivePrefix=

  63. [67]

    2308.07317 , archivePrefix=

    Platypus: Quick, Cheap, and Powerful Refinement of LLMs , author=. 2308.07317 , archivePrefix=

  64. [68]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  65. [69]

    International Conference on Learning Representations , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

  66. [70]

    2023 , eprint=

    Fundamental Limitations of Alignment in Large Language Models , author=. 2023 , eprint=

  67. [71]

    2023 , eprint=

    Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

  68. [72]

    2022 , eprint=

    Unsolved Problems in ML Safety , author=. 2022 , eprint=

  69. [73]

    , author=

    Monitor alarm fatigue: an integrative review. , author=. 2012 , howpublished=

  70. [74]

    2015 , eprint=

    Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=

  71. [75]

    2016 , eprint=

    Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , author=. 2016 , eprint=

  72. [76]

    2023 , eprint=

    Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2023 , eprint=

  73. [77]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  74. [78]

    2023 , eprint=

    GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher , author=. 2023 , eprint=

  75. [79]

    Improving alignment of dialogue agents via targeted human judgments , author=

  76. [80]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  77. [81]

    2023 , eprint=

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=

  78. [82]

    2023 , eprint=

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. 2023 , eprint=

  79. [83]

    D oc RED : A Large-Scale Document-Level Relation Extraction Dataset

    Yao, Yuan and Ye, Deming and Li, Peng and Han, Xu and Lin, Yankai and Liu, Zhenghao and Liu, Zhiyuan and Huang, Lixin and Zhou, Jie and Sun, Maosong. D oc RED : A Large-Scale Document-Level Relation Extraction Dataset. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1074

  80. [84]

    NAACL , year=

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. NAACL , year=

Showing first 80 references.