arxiv: 2209.07858 · v2 · submitted 2022-08-23 · 💻 cs.CL · cs.AI· cs.CY

Recognition: 2 theorem links

· Lean Theorem

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Amanda Askell, Andy Jones, Anna Chen, Ben Mann, Catherine Olsson, Chris Olah, Danny Hernandez, Dario Amodei, Dawn Drain, Deep Ganguli, Eli Tran-Johnson, Ethan Perez, Jack Clark, Jackson Kernion, Jared Kaplan, Josh Jacobson, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Joseph, Nicholas Schiefer, Nova DasSarma, Sam Bowman, Sam McCandlish, Sam Ringer, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El-Showk, Stanislav Fort, Tom Brown, Tom Conerly, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords red teaminglanguage modelsRLHFscaling behaviorsharmful outputssafety evaluationdataset release

0 comments

The pith

RLHF-trained language models become progressively harder to red-team into harmful outputs as they scale up in size, while other training approaches show no such improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether red teaming—deliberately prompting models to produce harmful responses—reveals different patterns of vulnerability depending on model size and training method. It compares plain language models, prompted helpful-honest-harmless models, rejection-sampling models, and RLHF models across three sizes from 2.7B to 52B parameters. The central result is that only the RLHF models grow harder to attack with scale; success rates for the other three categories remain roughly flat. The authors also release the full set of 38,961 attacks they collected and lay out their exact procedures so others can replicate or improve on the work.

Core claim

Across the tested model sizes and types, RLHF models show a clear increase in resistance to red team attacks as parameter count grows, whereas plain LMs, prompted LMs, and rejection-sampling LMs exhibit flat trends in attack success rate with scale. The work further catalogs a wide range of elicited harms, from overt offensive language to subtler non-violent unethical content, and supplies the complete attack dataset together with detailed methodology for community use.

What carries the argument

Comparative red-teaming success rate measured across four model training regimes (plain LM, prompted HH, rejection sampling, RLHF) at three parameter scales, with the RLHF regime as the variable that produces the observed scaling improvement in resistance.

If this is right

Larger RLHF models will likely need more advanced or automated red-teaming methods to continue uncovering residual harms.
The released attack dataset supplies a public benchmark that future safety methods can be measured against.
Training regimes other than RLHF do not appear to confer the same scaling advantage in resistance to attack.
Transparency in red-teaming procedures enables shared standards for evaluating model safety across labs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling pattern holds, RLHF-style training may provide a practical route for safety to improve alongside raw capability at larger scales.
The flat trends for non-RLHF models suggest that prompt engineering or rejection sampling alone are unlikely to close the safety gap as models grow.
The methods could be extended to test whether similar scaling resistance appears in multimodal or agentic systems trained with comparable feedback.

Load-bearing premise

The particular red-teaming instructions, prompts, and attack strategies used in the study are comprehensive enough to surface most or all of the harmful behaviors these models can exhibit.

What would settle it

A follow-up experiment that applies the same or closely matched attack distribution to a substantially larger RLHF model (for example 100B+ parameters) and measures an attack success rate that does not continue to decline, or that rises, would falsify the reported scaling trend for RLHF.

read the original abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLHF models get harder to red team with scale while the other types stay flat, plus a released dataset of 39k attacks.

read the letter

The main thing to know is that this paper finds RLHF models become progressively harder to red team as they scale from 2.7B to 52B parameters, while plain LMs, prompted helpful-honest-harmless versions, and rejection sampling models show flat trends. They also release the full set of 38,961 attacks for others to examine. That comparison across model types and sizes is the clearest new empirical piece here. The transparency about their red teaming instructions, statistical methods, and remaining uncertainties is useful in an area where practices are still settling. Releasing the data lets the community check the range of harmful outputs they surfaced, from overt offenses to subtler unethical cases. That combination of scaling observation and open resource is what makes the work worth attention. The soft spot is the one the stress-test note raises. Red teaming relies on human attackers who were not blinded to model type, so effort, strategy, or stopping rules could have shifted depending on what they saw. The paper describes its processes in detail, but without reported per-model metrics like turns per attack or fixed prompt templates, the RLHF scaling advantage could partly reflect differences in how the attacks were generated rather than model properties alone. The dataset allows later checks, but it does not retroactively fix collection-time variation. This is for alignment and safety researchers who want concrete scaling data or a starting corpus of attacks. It is not a finished method but gives usable observations and a resource. I would send it for peer review. The empirical comparison and data release are substantial enough to justify referee time, even if tighter controls on attack effort would strengthen the scaling claim.

Referee Report

1 major / 2 minor

Summary. The paper describes early efforts to red team language models to discover, measure, and reduce harmful outputs. It makes three contributions: (1) an investigation of scaling behaviors for red teaming success across three model sizes (2.7B, 13B, 52B) and four model types (plain LM, prompted helpful/honest/harmless, rejection sampling, and RLHF), finding that RLHF models become increasingly difficult to red team with scale while the other types show flat trends; (2) release of a dataset containing 38,961 red team attacks; and (3) detailed descriptions of instructions, processes, statistical methodologies, and uncertainties, along with analysis of the harmful outputs elicited (ranging from offensive language to subtle unethical behaviors).

Significance. If the scaling trends prove robust, the work supplies concrete empirical data on how alignment methods like RLHF affect vulnerability to adversarial elicitation of harms, informing safer deployment of larger models. The public release of the large attack dataset is a clear asset that enables independent verification and further research on red teaming techniques. The paper's emphasis on methodological transparency and explicit discussion of uncertainties is a positive contribution toward community standards in AI safety evaluation.

major comments (1)

[Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.

minor comments (2)

[Methods] Methods section: While the paper states it exhaustively describes statistical methodologies, adding explicit formulas or pseudocode for how red team success rates and uncertainty estimates were computed (including any adjustments for multiple comparisons across model sizes) would improve reproducibility.
[Dataset] Dataset description: The release of 38,961 attacks is valuable, but the paper would benefit from additional metadata on red teamer demographics, experience levels, and any training provided, to allow readers to assess potential sources of bias in the attack distribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's contributions to red teaming methods and the public dataset release. We address the major comment below.

read point-by-point responses

Referee: [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.

Authors: We thank the referee for highlighting this important potential confound in our scaling analysis. We agree that the absence of reported effort metrics leaves room for alternative interpretations. Our red teaming protocol used identical instructions, attack strategies, and stopping criteria for all model types and sizes, as described in the methods. However, we did not report per-model statistics on conversation length or prompt variants, and red teamers were not blinded to model identity. To address this directly, we will perform a post-hoc analysis of the released dataset of 38,961 attacks to compute effort-related metrics (average turns, unique variants attempted) broken down by model type and size, and include these results in the revised manuscript. This addition will support that the RLHF scaling trend reflects model behavior rather than differences in human effort. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling trends derived from direct measurements

full rationale

The paper reports empirical results from human red-teaming experiments across model scales and types, with the central claim (RLHF models become harder to red-team with scale while others show flat trends) resting on observed attack success rates in the released dataset of 38,961 attacks. No mathematical derivations, fitted parameters renamed as predictions, or self-citations are used to establish the scaling behaviors; the trends follow directly from the collected data without reduction to prior inputs or definitions. Self-citations appear only for background methods and are not load-bearing for the scaling observations. The work is self-contained against external benchmarks via the public dataset, which permits independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and does not introduce new mathematical axioms, free parameters, or invented entities; it builds on standard practices in machine learning and AI safety evaluation.

axioms (1)

domain assumption Human evaluators can reliably identify harmful outputs from language models
Underlying the red teaming and data analysis process.

pith-pipeline@v0.9.0 · 5668 in / 1224 out tokens · 76750 ms · 2026-05-12T01:32:45.315412+00:00 · methodology

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery
cs.CR 2026-05 unverdicted novelty 7.0

PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
cs.LG 2026-04 unverdicted novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
cs.LG 2026-03 unverdicted novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
cs.LG 2026-05 unverdicted novelty 6.0

PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Architecture, Not Scale: Circuit Localization in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
cs.HC 2026-05 unverdicted novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
cs.CL 2026-04 unverdicted novelty 6.0

Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
cs.LG 2026-04 unverdicted novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
cs.CY 2026-04 unverdicted novelty 6.0

Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
AVISE: Framework for Evaluating the Security of AI Systems
cs.CR 2026-04 unverdicted novelty 6.0

AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
cs.CR 2026-04 unverdicted novelty 6.0

SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
cs.CR 2024-04 unverdicted novelty 6.0

Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
cs.CL 2023-10 conditional novelty 6.0

AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Jailbroken: How Does LLM Safety Training Fail?
cs.LG 2023-07 unverdicted novelty 6.0

LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
cs.LG 2026-05 unverdicted novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
cs.CR 2026-05 accept novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Surrogate modeling for interpreting black-box LLMs in medical predictions
cs.CL 2026-04 unverdicted novelty 5.0

A surrogate modeling method approximates LLM-encoded medical knowledge via prompting to quantify variable influence and flag inaccuracies and racial biases.
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
cs.SE 2026-04 unverdicted novelty 5.0

Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
cs.CR 2026-04 unverdicted novelty 5.0

FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
cs.CL 2026-04 unverdicted novelty 5.0

Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
cs.AI 2026-05 unverdicted novelty 4.0

Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 42 Pith papers · 11 internal anchors

[1]

A. Abid, M. Farooqi, and J. Zou. Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6):461–463, June 2021. Number: 6 Publisher: Nature Publishing Group

work page 2021
[2]

A General Language Assistant as a Laboratory for Alignment

A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatﬁeld-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A General Language Assistant as a Labora- tory for Alignment. arXiv:2112.00861 [cs], Dec. 2021. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

S. Avin, H. Belﬁeld, M. Brundage, G. Krueger, J. Wang, A. Weller, M. Anderljung, I. Krawczuk, D. Krueger, J. Lebensold, T. Maharaj, and N. Zilberman. Filling gaps in trustworthy development of AI. Science, Dec. 2021. Publisher: American Association for the Advancement of Science

work page 2021
[4]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatﬁeld- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

P. Barrett. Research Highlights | Who Moderates the Social Media Giants? A Call to End Outsourcing - NYU Stern

work page
[6]

Improving question answering model robustness with synthetic adversarial data generation

M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8830–8848, 2021. arXiv:2104.08678 [cs]

work page arXiv 2021
[7]

Basta, M

C. Basta, M. R. Costa-jussà, and N. Casas. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. arXiv:1904.08783 [cs], Apr. 2019. arXiv: 1904.08783

work page arXiv 1904
[8]

Bates, M

D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1):1–48, 2015

work page 2015
[9]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, pages 610–623, New York, NY , USA, Mar. 2021. Association for Computing Machinery

work page 2021
[10]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Toward trustworthy ai development: mechanisms for supporting verifiable claims.arXiv preprint arXiv:2004.07213, 2020

M. Brundage, S. Avin, J. Wang, H. Belﬁeld, G. Krueger, G. Hadﬁeld, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

work page arXiv 2004
[12]

Buchanan, A

B. Buchanan, A. Lohn, M. Musser, and K. Sedova. Truth, Lies, and Automation, May 2021

work page 2021
[13]

Extracting training data from large language models

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs], June 2021. arXiv: 2012.07805

work page arXiv 2012
[14]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

work page 2017
[15]

B. Dang, M. J. Riedl, and M. Lease. But Who Protects the Moderators? The Case of Crowdsourced Image Moderation, Jan. 2020. arXiv:1804.10999 [cs]

work page arXiv 2020
[16]

A. Das, B. Dang, and M. Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):33–42, Oct. 2020

work page 2020
[17]

Diener, D

E. Diener, D. Wirtz, W. Tov, C. Kim-Prieto, D.-w. Choi, S. Oishi, and R. Biswas-Diener. New Well- being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indica- tors Research, 97(2):143–156, June 2010

work page 2010
[18]

Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

E. Dinan, G. Abercrombie, A. S. Bergman, S. Spruit, D. Hovy, Y .-L. Boureau, and V . Rieser. Antici- pating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs], July

work page arXiv
[19]

Dinan, A

E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston. Queens are Powerful too: Mit- igating Gender Bias in Dialogue Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online, Nov. 2020. Association for Computational Linguistics

work page 2020
[20]

Build it break it fix it for dialogue safety: Robustness from adversarial human attack

E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, Aug. 2019. arXiv:1908.06083 [cs]

work page arXiv 2019
[21]

Dixon, J

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and Mitigating Unintended Bias in Text Classiﬁcation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, pages 67–73, New York, NY , USA, Dec. 2018. Association for Computing Machinery

work page 2018
[22]

arXiv preprint arXiv:2202.07785 , year=

D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatﬁeld- Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark. Predictabil...

work page arXiv 2022
[23]

S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual Fairness in Text Classiﬁcation through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 219–226, New York, NY , USA, Jan. 2019. Association for Computing Machinery

work page 2019
[24]

doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], Dec. 2021. arXiv: 1803.09010

work page arXiv 2021
[25]

Realtoxicityprompts: Evaluating neural toxic degeneration in language models

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. ArXiv, abs/2009.11462, 2020

work page arXiv 2009
[26]

Gray and S

M. Gray and S. Suri. Ghost Work. Mariner Books, 2019. 27

work page 2019
[27]

E. A. Holmes, E. L. James, T. Coode-Bate, and C. Deeprose. Can Playing the Computer Game “Tetris” Reduce the Build-Up of Flashbacks for Trauma? A Proposal from Cognitive Science. PLOS ONE, 4(1):e4153, Jan. 2009. Publisher: Public Library of Science

work page 2009
[28]

Hutchinson, V

B. Hutchinson, V . Prabhakaran, E. Denton, K. Webster, Y . Zhong, and S. Denuyl. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5491–5501, Online, July 2020. Association for Computational Linguistics

work page 2020
[29]

Jia and P

R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. InProceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics

work page 2017
[30]

Jiang and M

Y . Jiang and M. Bansal. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736, Florence, Italy, July 2019. Association for Computational Linguistics

work page 2019
[31]

Karunakaran and R

S. Karunakaran and R. Ramakrishan. Testing Stylistic Interventions to Reduce Emotional Impact of Content Moderation Workers.Proceedings of the AAAI Conference on Human Computation and Crowd- sourcing, 7:50–58, Oct. 2019

work page 2019
[32]

Kiela, M

D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...

work page 2021
[33]

Kurita, N

K. Kurita, N. Vyas, A. Pareek, A. W. Black, and Y . Tsvetkov. Measuring Bias in Contextualized Word Representations. arXiv:1906.07337 [cs], June 2019. arXiv: 1906.07337

work page arXiv 1906
[34]

P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards Understanding and Mitigating Social Biases in Language Models. arXiv:2106.13219 [cs], June 2021. arXiv: 2106.13219

work page arXiv 2021
[35]

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs], Sept. 2021. arXiv: 2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. arXiv:2105.03023 [cs], June 2021. arXiv: 2105.03023

work page arXiv 2021
[37]

The radicalization risks of GPT-3 and advanced neural language models

K. McGufﬁe and A. Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv:2009.06807 [cs], Sept. 2020. arXiv: 2009.06807

work page arXiv 2009
[38]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Sept. 2020. arXiv:1802.03426 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[39]

Mishkin, L

P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 Preview - Risks and Limitations, 2022

work page 2022
[40]

Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4885–4901, Online, July 2020. Association for Computational Linguistics

work page 2020
[41]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, Mar

work page
[42]

arXiv:2203.02155 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs], Feb. 2022. arXiv: 2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hen- dricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Hig- gins, A. Creswell, N. McAleese, A. Wu, E. El...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Gen- eration with CLIP Latents, Apr. 2022. arXiv:2204.06125 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics

work page 2020
[47]

Röttger, B

P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert. HateCheck: Func- tional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online,...

work page 2021
[48]

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social Bias Frames: Reasoning about Social and Power Implications of Language. arXiv:1911.03891 [cs], Apr. 2020. arXiv: 1911.03891

work page arXiv 1911
[49]

Solaiman and C

I. Solaiman and C. Dennison. Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets. arXiv:2106.10328 [cs], Nov. 2021. arXiv: 2106.10328

work page arXiv 2021
[50]

Steiger, T

M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, and M. Lease. The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI ’21, pages 1–14, New York, NY , USA, May 2021. Association for Compu...

work page 2021
[51]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, Feb. 2014. arXiv:1312.6199 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

Understanding the capabilities, limitations, and societal impact of large language models

A. Tamkin, M. Brundage, J. Clark, and D. Ganguli. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv:2102.02503 [cs], Feb. 2021. arXiv: 2102.02503

work page arXiv 2021
[53]

E. R. Thompson. Development and Validation of an Internationally Reliable Short-Form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38(2):227–242, Mar

work page
[54]

Publisher: SAGE Publications Inc

work page
[55]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. Meier-Hellstern, M. R. Morris, T. Do...

work page Pith review arXiv 2022
[56]

C. US. U.S. Census Bureau QuickFacts: United States, July 2021

work page 2021
[57]

Wallace, A

E. Wallace, A. Williams, R. Jia, and D. Kiela. Analyzing Dynamic Adversarial Training Data in the Limit, Oct. 2021. arXiv:2110.08514 [cs]

work page arXiv 2021
[58]

Watson, L

D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology , 54(6):1063–1070, June 1988

work page 1988
[59]

Ethical and social risks of harm from Language Models

L. Weidinger, J. Mellor, M. Rauh, C. Grifﬁn, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs], Dec. 20...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Welbl, A

J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Cop- pin, and P.-S. Huang. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2447–2469, Punta Cana, Dominican Republic, Nov

work page 2021
[61]

Association for Computational Linguistics

work page
[62]

H. Xu, Y . Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review, Oct. 2019. arXiv:1909.08072 [cs, stat]. 29

work page arXiv 2019
[63]

J. Xu, D. Ju, M. Li, Y .-L. Boureau, J. Weston, and E. Dinan. Bot-Adversarial Dialogue for Safe Conversational Agents. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y . Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

work page 2021
[64]

D. M. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt-Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, and N. Thomas. Adversarial Training for High-Stakes Reliability, May 2022. arXiv:2205.01663 [cs]. 30

work page arXiv 2022