pith. machine review for the scientific record. sign in

arxiv: 2202.03286 · v1 · submitted 2022-02-07 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: no theorem link

Red Teaming Language Models with Language Models

Amelia Glaese, Ethan Perez, Francis Song, Geoffrey Irving, John Aslanides, Nat McAleese, Roman Ring, Saffron Huang, Trevor Cai

Pith reviewed 2026-05-11 20:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords red teaminglanguage modelsharmful behaviorsadversarial testingmodel safetyoffensive contentprompt engineeringreinforcement learning
0
0 comments X

The pith

One language model generates test cases to automatically uncover tens of thousands of harmful behaviors in another language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a separate language model can produce test questions to provoke and reveal harmful replies from a target model, replacing the need for humans to write every test case by hand. Methods range from basic prompting to reinforcement learning, yielding far more test cases than manual approaches allow. The target 280B-parameter chatbot produces thousands of offensive replies that a trained classifier flags, plus other issues such as leaked private data and biased statements about groups of people. A sympathetic reader would care because current safety checks rely on expensive, limited human effort that cannot keep pace with model scale or deployment speed.

Core claim

We automatically find cases where a target LM behaves in a harmful way, by generating test cases (red teaming) using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and医院电话号

What carries the argument

LM-generated adversarial test cases scored by an offensive-content classifier, with prompt engineering and reinforcement learning to control diversity and difficulty.

If this is right

  • Red teaming can scale to far more test cases than human annotation permits.
  • Prompt engineering on the generator LM surfaces targeted harms such as privacy leaks and group-based offense.
  • Reinforcement learning on the generator produces harder test cases than zero-shot prompting.
  • Multi-turn conversation harms become discoverable once the generator produces sequences of prompts.
  • Fixing the behaviors the method uncovers improves safety before user exposure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generator LM could be reused continuously during model training to provide ongoing adversarial signals.
  • The approach may extend beyond text to code or image models if analogous classifiers and generators are built.
  • Combining LM red teaming with human review could create a hybrid pipeline that catches both obvious and subtle harms.
  • Models trained to resist their own generated attacks might close the loop into self-improving safety.

Load-bearing premise

The classifier accurately flags the relevant harms and the automatically generated test cases are diverse enough to represent real user interactions.

What would settle it

A side-by-side human review of the same model outputs showing that the classifier misses most of the harms the authors report, or that the LM-generated prompts produce replies unlike those from actual users.

read the original abstract

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an automated approach to red-teaming language models using other language models to generate test cases. The authors generate prompts via zero-shot, few-shot, and reinforcement learning methods, then classify the target 280B LM's responses as offensive using a trained classifier, reporting tens of thousands of such instances. They additionally use prompt engineering to probe for other harms including biased group discussions, phone number leakage, training data memorization, and multi-turn conversation harms.

Significance. If the classifier evaluation holds under the adversarial conditions, this work demonstrates a scalable method for discovering LM harms that complements limited human annotation, with the reported scale of findings (tens of thousands of cases) and multiple generation strategies providing concrete evidence of utility for pre-deployment safety testing.

major comments (2)
  1. [§4] §4 (offensive reply results): The headline quantitative finding of tens of thousands of offensive replies is produced by applying the binary classifier to responses from RL-optimized prompts. No precision, recall, or manual audit is reported on these specific out-of-distribution pairs, even though RL directly maximizes the classifier score; this leaves the false-positive rate unknown and directly affects the scale and comparative claims versus human red-teaming.
  2. [§3.3] §3.3 (RL prompt generation): Because the RL objective optimizes test cases against the same classifier used for labeling, the reported diversity and difficulty advantages of RL-generated cases cannot be interpreted without evidence that classifier precision is preserved in this regime.
minor comments (2)
  1. [§1] The paper would benefit from an explicit statement in the abstract or §1 of the classifier's known limitations when applied to LM-generated adversarial inputs.
  2. [Table 1] Table 1 (method comparison): the diversity and difficulty metrics should include their exact definitions and any normalization used across zero-shot, few-shot, and RL conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and recommendation for major revision. We have carefully considered the comments regarding the validation of the classifier under RL optimization and will revise the manuscript accordingly to provide additional evidence and clarifications.

read point-by-point responses
  1. Referee: §4 (offensive reply results): The headline quantitative finding of tens of thousands of offensive replies is produced by applying the binary classifier to responses from RL-optimized prompts. No precision, recall, or manual audit is reported on these specific out-of-distribution pairs, even though RL directly maximizes the classifier score; this leaves the false-positive rate unknown and directly affects the scale and comparative claims versus human red-teaming.

    Authors: We acknowledge that the current manuscript does not report a dedicated precision, recall, or manual audit specifically for the RL-optimized cases. While the classifier was evaluated on held-out data during training, this does not fully address the out-of-distribution regime induced by RL. In the revised manuscript, we will add a manual audit of a random sample of 100 RL-generated test cases (including their target model responses) to estimate the false-positive rate in this setting. This will allow better assessment of the reported scale and support comparisons to human red-teaming. revision: yes

  2. Referee: §3.3 (RL prompt generation): Because the RL objective optimizes test cases against the same classifier used for labeling, the reported diversity and difficulty advantages of RL-generated cases cannot be interpreted without evidence that classifier precision is preserved in this regime.

    Authors: This observation is correct: the RL method optimizes directly against the classifier, so claims about diversity and difficulty advantages are conditional on the classifier remaining reliable in the optimized regime. We will revise §3.3 to explicitly discuss this dependency as a limitation of the approach. The section will also reference the new manual audit results to provide supporting evidence for classifier precision under RL optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical red-teaming pipeline

full rationale

The paper presents an empirical methodology for automatically generating test cases with language models and evaluating target-model responses via a separately trained classifier. No derivation chain, first-principles result, or prediction is claimed; the headline finding (tens of thousands of offensive replies) is a direct count of classifier-labeled outputs on generated prompts. The RL stage optimizes prompts to maximize the classifier score, but this is an explicit experimental design choice rather than a hidden reduction of the result to its inputs. No self-citation load-bearing step, ansatz smuggling, or renaming of known results occurs. The work is self-contained against external benchmarks in the sense that its claims rest on observable experimental outputs, not on equations that collapse to fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is empirical and relies on standard machine-learning assumptions (classifier reliability, LM generative capability) without introducing new free parameters, axioms, or invented entities in the abstract.

pith-pipeline@v0.9.0 · 5541 in / 992 out tokens · 49269 ms · 2026-05-11T20:52:33.524394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    cs.CR 2026-04 unverdicted novelty 8.0

    DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

  2. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

    cs.AI 2026-05 unverdicted novelty 7.0

    LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

  3. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  4. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 7.0

    Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

  5. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  6. Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline

    cs.CY 2026-04 unverdicted novelty 7.0

    Single-channel anonymization hides identity bias via cancellation effects, but full-pipeline anonymization reveals that homogeneous ensembles amplify sycophancy while heterogeneous ones reduce it, with one model showi...

  7. PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

    cs.LG 2026-04 unverdicted novelty 7.0

    Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

  8. Adaptive Instruction Composition for Automated LLM Red-Teaming

    cs.CR 2026-04 unverdicted novelty 7.0

    Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.

  9. HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking

    cs.CR 2026-04 unverdicted novelty 7.0

    HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.

  10. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  11. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  12. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  13. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  14. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  15. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

    cs.CR 2026-05 unverdicted novelty 6.0

    DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

  16. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  17. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 6.0

    PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...

  18. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    cs.AI 2026-05 unverdicted novelty 6.0

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...

  19. Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    cs.CR 2026-05 unverdicted novelty 6.0

    Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

  20. Ethics Testing: Proactive Identification of Generative AI System Harms

    cs.SE 2026-04 unverdicted novelty 6.0

    Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.

  21. Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.

  22. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  23. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

  24. From Admission to Invariants: Measuring Deviation in Delegated Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer ...

  25. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  26. Jailbreaking Black Box Large Language Models in Twenty Queries

    cs.LG 2023-10 conditional novelty 6.0

    PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

  27. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    cs.CL 2023-10 conditional novelty 6.0

    AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...

  28. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  29. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    cs.CL 2022-08 accept novelty 6.0

    RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

  30. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  31. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  32. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    cs.CL 2022-04 unverdicted novelty 6.0

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...

  33. Laundering AI Authority with Adversarial Examples

    cs.CR 2026-05 unverdicted novelty 5.0

    Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.

  34. A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

    cs.CR 2026-05 accept novelty 5.0

    The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

  35. Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...

  36. A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.

  37. PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk

    cs.AI 2026-04 unverdicted novelty 5.0

    PRISM defines 27 behavioral risk signals from structural anomalies in AI value, evidence, and source hierarchies, evaluated via dual thresholds on forced-choice data from seven models.

  38. Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

    cs.CL 2026-04 unverdicted novelty 5.0

    GCD tightens jailbreak detection with acceptance and refusal anchors and guarantees safe outputs by pre-injecting refusal tokens, cutting false positives 52% versus GradSafe while adding minimal latency.

  39. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  40. Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

    cs.AI 2026-05 unverdicted novelty 4.0

    Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.

  41. Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

    cs.AI 2026-05 unverdicted novelty 4.0

    AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

  42. Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

    cs.CL 2026-04 unverdicted novelty 4.0

    Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 41 Pith papers · 1 internal anchor

  1. [1]

    In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7345–7349

    Universal adversarial attacks on text classifiers. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7345–7349. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances ...

  2. [2]

    In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901

    Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Lingui...

  3. [3]

    Evaluating large language models trained on code. CoRR. Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, Timothy Sohn, and Yonghui Wu. 2019. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discov...

  4. [4]

    Way off-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456,

    Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 65–83, Online. Association for Computational Linguistics. Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. 2019. Way...

  5. [5]

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2122–2132, Austin, Texas

    How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2122–2132, Austin, Texas. Association for Computational Linguistics. Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, an...

  6. [6]

    CoRR, abs/1909.06044

    Say what I want: Towards the dark side of neural dialogue models. CoRR, abs/1909.06044. Haochen Liu, Zhiwei Wang, Tyler Derr, and Jiliang Tang. 2020b. Chat as expected: Learning to manipulate black-box neural dialogue models. CoRR, abs/2005.13170. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into transferable adversarial examples and b...

  7. [7]

    Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

    Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 739–753. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial N...

  8. [8]

    Finding generalizable evidence by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2402–2411, Hong Kong, China. Association for Computational Linguistics. Ethan Perez, Douwe Kiela, and Ky...

  9. [9]

    True few-shot learning with language models. arXiv. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang...

  10. [10]

    HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 41–58, Online. Association for Computational Linguistics. Benjamin I. P. Rubinstein, Peter L. Bar...

  11. [11]

    Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8741–8748

    Hierarchical reinforcement learning for open- domain dialog. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8741–8748. Simon Schmitt, Jonathan J. Hudson, Augustin Zídek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Küttler, Andrew Zisserman, Karen Simonyan, and S. M. Ali Eslami

  12. [12]

    CoRR, abs/1803.03835

    Kickstarting deep reinforcement learning. CoRR, abs/1803.03835. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 4596–4604. PMLR. Emily Sheng, Kai-Wei Chang, Premkumar ...

  13. [13]

    Detoxifying language models risks marginalizing minority voices

    Neural text generation with unlikelihood training. In International Conference on Learning Representations. Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021a. Recursively summarizing books with human feedback. Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2021b. Polyjuice: Genera...

  14. [14]

    Fine-Tuning Language Models from Human Preferences

    Generating Natural Adversarial Examples. In International Conference on Learning Representations (ICLR). Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIG...

  15. [15]

    Hello” to the dialogue before making a prediction. We chose “Hello,

    with one template. We use Adam (Kingma and Ba, 2015) with a learning rate of 3 × 10−5. The classifier outputs a probability that an utterance is offensive, and we use a threshold of # Params Acc F1 AUC Xu et al. 2021b 0.6×109 85.1 80.8 93.0 Gopher 1.4B 1.4×109 84.5 87.5 92.4 Table 8: Our offensiveness classifier performs similar to that of Xu et al. (2021b)...