Recognition: no theorem link
Red Teaming Language Models with Language Models
Pith reviewed 2026-05-11 20:52 UTC · model grok-4.3
The pith
One language model generates test cases to automatically uncover tens of thousands of harmful behaviors in another language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We automatically find cases where a target LM behaves in a harmful way, by generating test cases (red teaming) using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and医院电话号
What carries the argument
LM-generated adversarial test cases scored by an offensive-content classifier, with prompt engineering and reinforcement learning to control diversity and difficulty.
If this is right
- Red teaming can scale to far more test cases than human annotation permits.
- Prompt engineering on the generator LM surfaces targeted harms such as privacy leaks and group-based offense.
- Reinforcement learning on the generator produces harder test cases than zero-shot prompting.
- Multi-turn conversation harms become discoverable once the generator produces sequences of prompts.
- Fixing the behaviors the method uncovers improves safety before user exposure.
Where Pith is reading between the lines
- The same generator LM could be reused continuously during model training to provide ongoing adversarial signals.
- The approach may extend beyond text to code or image models if analogous classifiers and generators are built.
- Combining LM red teaming with human review could create a hybrid pipeline that catches both obvious and subtle harms.
- Models trained to resist their own generated attacks might close the loop into self-improving safety.
Load-bearing premise
The classifier accurately flags the relevant harms and the automatically generated test cases are diverse enough to represent real user interactions.
What would settle it
A side-by-side human review of the same model outputs showing that the classifier misses most of the harms the authors report, or that the LM-generated prompts produce replies unlike those from actual users.
read the original abstract
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an automated approach to red-teaming language models using other language models to generate test cases. The authors generate prompts via zero-shot, few-shot, and reinforcement learning methods, then classify the target 280B LM's responses as offensive using a trained classifier, reporting tens of thousands of such instances. They additionally use prompt engineering to probe for other harms including biased group discussions, phone number leakage, training data memorization, and multi-turn conversation harms.
Significance. If the classifier evaluation holds under the adversarial conditions, this work demonstrates a scalable method for discovering LM harms that complements limited human annotation, with the reported scale of findings (tens of thousands of cases) and multiple generation strategies providing concrete evidence of utility for pre-deployment safety testing.
major comments (2)
- [§4] §4 (offensive reply results): The headline quantitative finding of tens of thousands of offensive replies is produced by applying the binary classifier to responses from RL-optimized prompts. No precision, recall, or manual audit is reported on these specific out-of-distribution pairs, even though RL directly maximizes the classifier score; this leaves the false-positive rate unknown and directly affects the scale and comparative claims versus human red-teaming.
- [§3.3] §3.3 (RL prompt generation): Because the RL objective optimizes test cases against the same classifier used for labeling, the reported diversity and difficulty advantages of RL-generated cases cannot be interpreted without evidence that classifier precision is preserved in this regime.
minor comments (2)
- [§1] The paper would benefit from an explicit statement in the abstract or §1 of the classifier's known limitations when applied to LM-generated adversarial inputs.
- [Table 1] Table 1 (method comparison): the diversity and difficulty metrics should include their exact definitions and any normalization used across zero-shot, few-shot, and RL conditions.
Simulated Author's Rebuttal
We appreciate the referee's detailed review and recommendation for major revision. We have carefully considered the comments regarding the validation of the classifier under RL optimization and will revise the manuscript accordingly to provide additional evidence and clarifications.
read point-by-point responses
-
Referee: §4 (offensive reply results): The headline quantitative finding of tens of thousands of offensive replies is produced by applying the binary classifier to responses from RL-optimized prompts. No precision, recall, or manual audit is reported on these specific out-of-distribution pairs, even though RL directly maximizes the classifier score; this leaves the false-positive rate unknown and directly affects the scale and comparative claims versus human red-teaming.
Authors: We acknowledge that the current manuscript does not report a dedicated precision, recall, or manual audit specifically for the RL-optimized cases. While the classifier was evaluated on held-out data during training, this does not fully address the out-of-distribution regime induced by RL. In the revised manuscript, we will add a manual audit of a random sample of 100 RL-generated test cases (including their target model responses) to estimate the false-positive rate in this setting. This will allow better assessment of the reported scale and support comparisons to human red-teaming. revision: yes
-
Referee: §3.3 (RL prompt generation): Because the RL objective optimizes test cases against the same classifier used for labeling, the reported diversity and difficulty advantages of RL-generated cases cannot be interpreted without evidence that classifier precision is preserved in this regime.
Authors: This observation is correct: the RL method optimizes directly against the classifier, so claims about diversity and difficulty advantages are conditional on the classifier remaining reliable in the optimized regime. We will revise §3.3 to explicitly discuss this dependency as a limitation of the approach. The section will also reference the new manual audit results to provide supporting evidence for classifier precision under RL optimization. revision: yes
Circularity Check
No circularity: purely empirical red-teaming pipeline
full rationale
The paper presents an empirical methodology for automatically generating test cases with language models and evaluating target-model responses via a separately trained classifier. No derivation chain, first-principles result, or prediction is claimed; the headline finding (tens of thousands of offensive replies) is a direct count of classifier-labeled outputs on generated prompts. The RL stage optimizes prompts to maximize the classifier score, but this is an explicit experimental design choice rather than a hidden reduction of the result to its inputs. No self-citation load-bearing step, ansatz smuggling, or renaming of known results occurs. The work is self-contained against external benchmarks in the sense that its claims rest on observable experimental outputs, not on equations that collapse to fitted parameters.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 42 Pith papers
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline
Single-channel anonymization hides identity bias via cancellation effects, but full-pipeline anonymization reveals that homogeneous ensembles amplify sycophancy while heterogeneous ones reduce it, with one model showi...
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Adaptive Instruction Composition for Automated LLM Red-Teaming
Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
-
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
-
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
-
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
-
Ethics Testing: Proactive Identification of Generative AI System Harms
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
-
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
From Admission to Invariants: Measuring Deviation in Delegated Agent Systems
The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer ...
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergen...
-
Laundering AI Authority with Adversarial Examples
Adversarial examples enable AI authority laundering by causing production VLMs to give authoritative but wrong responses on subtly perturbed images, with success rates of 22-100% using decade-old attack methods.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents
A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...
-
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.
-
PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk
PRISM defines 27 behavioral risk signals from structural anomalies in AI value, evidence, and source hierarchies, evaluated via dual thresholds on forced-choice data from seven models.
-
Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering
GCD tightens jailbreak detection with acceptance and refusal anchors and guarantees safe outputs by pre-injecting refusal tokens, cutting false positives 52% versus GradSafe while adding minimal latency.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation
Closure of the Perspective API exposes structural dependence on a single proprietary toxicity scorer, leaving non-updatable benchmarks and irreproducible results while risking continued reliance on closed LLMs.
Reference graph
Works this paper leans on
-
[1]
Universal adversarial attacks on text classifiers. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7345–7349. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances ...
work page 2019
-
[2]
In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901
Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Lingui...
work page 1901
-
[3]
Evaluating large language models trained on code. CoRR. Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, Timothy Sohn, and Yonghui Wu. 2019. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discov...
-
[4]
Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 65–83, Online. Association for Computational Linguistics. Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. 2019. Way...
-
[5]
How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2122–2132, Austin, Texas. Association for Computational Linguistics. Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, an...
work page 2016
-
[6]
Say what I want: Towards the dark side of neural dialogue models. CoRR, abs/1909.06044. Haochen Liu, Zhiwei Wang, Tyler Derr, and Jiliang Tang. 2020b. Chat as expected: Learning to manipulate black-box neural dialogue models. CoRR, abs/2005.13170. Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into transferable adversarial examples and b...
-
[7]
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 739–753. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial N...
work page Pith review arXiv 2019
-
[8]
Finding generalizable evidence by learning to convince Q&A models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2402–2411, Hong Kong, China. Association for Computational Linguistics. Ethan Perez, Douwe Kiela, and Ky...
work page 2019
-
[9]
True few-shot learning with language models. arXiv. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang...
work page 2021
-
[10]
HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 41–58, Online. Association for Computational Linguistics. Benjamin I. P. Rubinstein, Peter L. Bar...
work page 2012
-
[11]
Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8741–8748
Hierarchical reinforcement learning for open- domain dialog. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8741–8748. Simon Schmitt, Jonathan J. Hudson, Augustin Zídek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Küttler, Andrew Zisserman, Karen Simonyan, and S. M. Ali Eslami
-
[12]
Kickstarting deep reinforcement learning. CoRR, abs/1803.03835. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 4596–4604. PMLR. Emily Sheng, Kai-Wei Chang, Premkumar ...
-
[13]
Detoxifying language models risks marginalizing minority voices
Neural text generation with unlikelihood training. In International Conference on Learning Representations. Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021a. Recursively summarizing books with human feedback. Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2021b. Polyjuice: Genera...
-
[14]
Fine-Tuning Language Models from Human Preferences
Generating Natural Adversarial Examples. In International Conference on Learning Representations (ICLR). Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIG...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Hello” to the dialogue before making a prediction. We chose “Hello,
with one template. We use Adam (Kingma and Ba, 2015) with a learning rate of 3 × 10−5. The classifier outputs a probability that an utterance is offensive, and we use a threshold of # Params Acc F1 AUC Xu et al. 2021b 0.6×109 85.1 80.8 93.0 Gopher 1.4B 1.4×109 84.5 87.5 92.4 Table 8: Our offensiveness classifier performs similar to that of Xu et al. (2021b)...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.