Recognition: 2 theorem links
· Lean TheoremDetecting Language Model Attacks with Perplexity
Pith reviewed 2026-05-15 13:55 UTC · model grok-4.3
The pith
Adversarial jailbreak suffixes produce high perplexity under GPT-2, allowing a classifier on perplexity and length to catch most attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that adversarial suffixes produce exceedingly high perplexity values under GPT-2. They demonstrate that while plain perplexity filtering faces significant false positives from varied normal prompts, a Light-GBM classifier trained on perplexity and token length correctly identifies most adversarial attacks in their test set.
What carries the argument
Perplexity score from GPT-2 on the full query, paired with token length as features for a Light-GBM classifier that separates adversarial from normal prompts.
If this is right
- Perplexity checks can be inserted as an early filter to block many jailbreak attempts before they reach the target LLM.
- Adding token length to the classifier reduces errors caused by unusual but benign user prompts.
- An open-source model like GPT-2 can act as the detector without any access to the target model's parameters or responses.
- The method limits exposure to prompts that request instructions for explosives, theft, or other harmful content.
Where Pith is reading between the lines
- Attackers could eventually discover suffixes that keep perplexity low under GPT-2, requiring retraining or replacement of the detector.
- The same perplexity-plus-length approach might be tested on other open models if GPT-2 loses effectiveness against new attacks.
- Combining the classifier with downstream checks on the model's generated output could raise overall resistance to evolving jailbreaks.
Load-bearing premise
The collection of regular prompts used to measure false positives reflects real-world variety, and future attackers will not adapt their suffixes to produce low perplexity under GPT-2.
What would settle it
Generation of adversarial suffixes that achieve low perplexity under GPT-2 yet still succeed in jailbreaking the target model would show the detection method fails.
read the original abstract
A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that queries containing adversarial suffixes for LLM jailbreaks exhibit high perplexity under GPT-2, that plain perplexity filtering produces many false positives on diverse normal prompts, and that a Light-GBM classifier using perplexity plus token length resolves those false positives while correctly detecting most attacks in the test set.
Significance. If the detection remains reliable, the approach supplies a lightweight, external-model filter that requires no access to the target LLM and could be deployed as a first-stage guardrail; the empirical separation shown for the tested attacks is a concrete, immediately usable signal.
major comments (2)
- [Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.
- [Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.
minor comments (2)
- [Data] Specify the exact distribution and size of the regular (non-adversarial) prompt corpus used to measure false-positive rates.
- [Methods] Report the Light-GBM hyper-parameters, training/validation split, and feature-importance values to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] The evaluation only considers the fixed, non-adaptive adversarial suffixes from the source attack papers; no experiments generate or test suffixes that explicitly minimize GPT-2 perplexity (e.g., via token-level gradient search or evolutionary search) while preserving jailbreak success. This is load-bearing for any claim of general detection utility.
Authors: We agree that evaluating against adaptive attacks designed to minimize GPT-2 perplexity is important for assessing the general utility of the detection method. Our current work focuses on the adversarial suffixes as published in the source papers, which already exhibit high perplexity. We will add a new subsection in the discussion to explicitly acknowledge this limitation and suggest future experiments using optimization techniques like gradient search to generate low-perplexity jailbreaks. We believe the observed separation for existing attacks still demonstrates the potential of perplexity-based detection as an initial filter. revision: partial
-
Referee: [Abstract and Results] The abstract and results sections report classifier performance but supply no test-set size, exact metrics (precision/recall/AUC), baseline comparisons, or error analysis, leaving the quantitative strength of the separation difficult to evaluate.
Authors: We apologize for the lack of detailed quantitative reporting. We will update the abstract and results section to include the test-set composition (number of normal and adversarial prompts), the exact performance metrics of the LightGBM classifier (including precision, recall, and AUC), comparisons against the perplexity-only baseline, and an error analysis highlighting the types of false positives encountered with plain perplexity filtering. revision: yes
Circularity Check
No significant circularity; empirical pipeline using external model
full rationale
The paper computes perplexity on queries using a fixed external open-source LLM (GPT-2), observes high values for adversarial suffixes, and trains a separate Light-GBM classifier on the resulting perplexity values plus token length. No equations, self-citations, or derivations reduce the detection claim to a fitted parameter or prior result by construction. The approach is a standard data-driven ML pipeline whose central claim rests on observable differences between the tested adversarial and regular prompt distributions rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- LightGBM hyperparameters
axioms (1)
- domain assumption Perplexity from GPT-2 reliably distinguishes adversarial suffixes from normal text when combined with length.
Lean theorems connected to this paper
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the Greedy Coordinate Gradient (GCG) algorithm described in (Zou et al., 2023). We treat it as a black box
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Reference graph
Works this paper leans on
-
[1]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[3]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019
work page 2019
-
[4]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.\ 1310--1320. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.pre...
work page 2019
-
[5]
Monitor alarm fatigue: an integrative review
Maria Cvach. Monitor alarm fatigue: an integrative review. Biomedical instrumentation & technology, 2012
work page 2012
-
[6]
Improving alignment of dialogue agents via targeted human judgments, 2022
Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...
work page 2022
-
[7]
Goodfellow, Jonathon Shlens, and Christian Szegedy
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015
work page 2015
-
[9]
Unsolved problems in ml safety, 2022
Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety, 2022
work page 2022
-
[10]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[11]
Huggingface. perplexity, howpublished = https://huggingface.co/docs/transformers/perplexity , note = Accessed: 2023-08-26 , 2023
work page 2023
-
[12]
Baseline defenses for adversarial attacks against aligned language models, 2023
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models, 2023
work page 2023
-
[13]
Rubén Darío Jaramillo. Jaramilo:gpt4jailbreak, howpublished = https://huggingface.co/datasets/rubend18/chatgpt-jailbreak-prompts , 2023. Accessed: 2023-09-20
work page 2023
-
[14]
Automatically auditing large language models via discrete optimization, 2023
Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization, 2023
work page 2023
-
[15]
Open sesame! universal black box jailbreaking of large language models, 2023
Raz Lapid, Ron Langberg, and Moshe Sipper. Open sesame! universal black box jailbreaking of large language models, 2023
work page 2023
-
[16]
Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms, 2023
work page 2023
-
[17]
Globally-robust neural networks, 2021
Klas Leino, Zifan Wang, and Matt Fredrikson. Globally-robust neural networks, 2021
work page 2021
-
[18]
Rain: Your language models can align themselves without finetuning, 2023
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning, 2023
work page 2023
-
[19]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb
work page 2018
-
[20]
Tapir: Trigger action platform for information retrieval
Annunziata Elefante Mattia Limone, Gaetano Cimino. Tapir: Trigger action platform for information retrieval. https://github.com/MattiaLimone/ifttt_recommendation_system, 2023
work page 2023
-
[21]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018
work page 2018
- [22]
-
[23]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022
work page 2022
-
[24]
Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, 2016
work page 2016
-
[25]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In NA. OpenAI, 2019. URL https://api.semanticscholar.org/CorpusID:160025533
work page 2019
-
[26]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Nay, Kshitij Gupta, and Aran Komatsuzaki
Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, and Aran Komatsuzaki. Arb: Advanced reasoning benchmark for large language models, 2023
work page 2023
-
[28]
Logan IV au2, Eric Wallace, and Sameer Singh
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020
work page 2020
-
[29]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, and Nikolay Bashlykov. Llama 2: Open foundation and fine-tuned chat models, 2023
work page 2023
-
[31]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023
work page 2023
-
[32]
Jailbroken: How does llm safety training fail?, 2023
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023
work page 2023
-
[33]
Fundamental limitations of alignment in large language models, 2023
Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models, 2023
work page 2023
-
[34]
Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. Dp-gan: Diversity-promoting generative adversarial network for generating informative and diversified text, 2018
work page 2018
-
[36]
Reclor: A reading comprehension dataset requiring logical reasoning
Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations (ICLR), April 2020
work page 2020
-
[37]
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023
work page 2023
-
[38]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[39]
Zico Kolter, and Matt Fredrikson
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023
work page 2023
-
[40]
Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=
Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=
work page 2014
- [41]
-
[42]
Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=
Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=
work page 2014
-
[43]
Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Adversarial Examples Are Not Bugs, They Are Features , author=. 2019 , eprint=
work page 2019
-
[45]
International Conference on Learning Representations , year=
Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations , year=
-
[46]
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts , author=. 2020 , eprint=
work page 2020
-
[47]
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[48]
Automatically Auditing Large Language Models via Discrete Optimization , author=. 2023 , eprint=
work page 2023
-
[49]
Adversarial Attack and Defense of Structured Prediction Models
Han, Wenjuan and Zhang, Liwen and Jiang, Yong and Tu, Kewei. Adversarial Attack and Defense of Structured Prediction Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.182
-
[50]
DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text , author=. 2018 , eprint=
work page 2018
-
[51]
Certified Adversarial Robustness via Randomized Smoothing
Jeremy Cohen and Elan Rosenfeld and J. Zico Kolter , title =. CoRR , volume =. 2019 , url =. 1902.02918 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [52]
-
[53]
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
work page 2022
-
[54]
gpt-xl, howpublished =
-
[55]
perplexity, howpublished =
-
[56]
vicuna7b, howpublished =
-
[57]
Proceedings of the 36th International Conference on Machine Learning , pages =
Certified Adversarial Robustness via Randomized Smoothing , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
work page 2019
-
[58]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[59]
Jaramilo:GPT4Jailbreak, howpublished =
Rubén Darío Jaramillo , publisher =. Jaramilo:GPT4Jailbreak, howpublished =
-
[60]
Zhao:llmOpenDatasets, howpublished =
-
[61]
GitHub repository , howpublished =
Mattia Limone, Gaetano Cimino, Annunziata Elefante , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[62]
Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=
work page 2018
-
[63]
International Conference on Learning Representations (ICLR) , month =
Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi , title =. International Conference on Learning Representations (ICLR) , month =
-
[64]
preprint arXiv:2305.12524 , year=
TheoremQA: A Theorem-driven Question Answering dataset , author=. preprint arXiv:2305.12524 , year=
-
[65]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[66]
ARB: Advanced Reasoning Benchmark for Large Language Models , author=. 2307.13692 , archivePrefix=
-
[67]
Platypus: Quick, Cheap, and Powerful Refinement of LLMs , author=. 2308.07317 , archivePrefix=
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[69]
International Conference on Learning Representations , year=
LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
-
[70]
Fundamental Limitations of Alignment in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[71]
Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=
work page 2023
- [72]
- [73]
-
[74]
Explaining and Harnessing Adversarial Examples , author=. 2015 , eprint=
work page 2015
-
[75]
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , author=. 2016 , eprint=
work page 2016
-
[76]
Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2023 , eprint=
work page 2023
- [77]
-
[78]
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher , author=. 2023 , eprint=
work page 2023
-
[79]
Improving alignment of dialogue agents via targeted human judgments , author=
-
[80]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[81]
Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=
work page 2023
-
[82]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[83]
D oc RED : A Large-Scale Document-Level Relation Extraction Dataset
Yao, Yuan and Ye, Deming and Li, Peng and Han, Xu and Lin, Yankai and Liu, Zhenghao and Liu, Zhiyuan and Huang, Lixin and Zhou, Jie and Sun, Maosong. D oc RED : A Large-Scale Document-Level Relation Extraction Dataset. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1074
-
[84]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. NAACL , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.