super hub Canonical reference

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Amanda Askell, Deep Ganguli, Jackson Kernion, Liane Lovitt, Saurav Kadavath, Yuntao Bai · 2022 · cs.CL · arXiv 2209.07858

Canonical reference. 86% of citing Pith papers cite this work as background.

146 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 146 citing papers more from Amanda Askell arXiv PDF

abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 method 3 dataset 1

citation-polarity summary

background 24 use method 2 support 1 use dataset 1

claims ledger

abstract We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to re

authors

Amanda Askell Deep Ganguli Jackson Kernion Liane Lovitt Saurav Kadavath Yuntao Bai

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Releases the first public safety evaluation dataset for Albanian LLMs with 2,951 prompts spanning 11 categories including self-harm, violence, and radicalization.

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation

cs.CY · 2026-03-27 · conditional · novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.

Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

cs.LG · 2026-02-26 · conditional · novelty 7.0

Direction-flipped influence audits show contextual cues shift LLM moral choices by 12-18 points on average across multiple benchmarks, revealing asymmetries, backfires, and inconsistencies in 40% of conditions.

citing papers explorer

Showing 11 of 11 citing papers after filters.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs cs.CY · 2026-06-27 · unverdicted · none · ref 8 · internal anchor
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation cs.CY · 2026-03-27 · conditional · none · ref 1 · internal anchor
M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
"Unlimited Realm of Exploration and Experimentation": Methods and Motivations of AI-Generated Sexual Content Creators cs.CY · 2026-01-28 · conditional · none · ref 43 · internal anchor
Interviews with 28 AIG-SC creators show motivations spanning sexual exploration, creative expression, technical experimentation, and occasional production of non-consensual intimate imagery.
Affective AI Safety: The Missing Piece in LLM Safety cs.CY · 2026-06-22 · unverdicted · none · ref 33 · internal anchor
Proposes affective safety as a distinct class of AI harms with a taxonomy of self-alienation, bias, and relational harms, arguing that existing safety frameworks address it narrowly or not at all and calling for dedicated approaches focused on cumulative and identity-level effects.
Efficient Safety Benchmarking via Item Response Theory cs.CY · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.
Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South cs.CY · 2026-05-18 · unverdicted · none · ref 27 · internal anchor
A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles cs.CY · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
Evaluating AI Providers' Frontier Safety Frameworks cs.CY · 2025-12-01 · unverdicted · none · ref 42 · internal anchor
Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums? cs.CY · 2026-05-30 · unverdicted · none · ref 41 · internal anchor
Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.
Understanding Censorship in Large Language Models: From Mechanisms to Governance cs.CY · 2026-06-16 · unverdicted · none · ref 22 · internal anchor
Synthesizes mechanisms of LLM censorship across the model lifecycle and argues that the key issue is making moderation proportionate, accountable, pluralistic, and auditable rather than debating whether moderation should occur.
Brainrot: Deskilling and Addiction are Overlooked AI Risks cs.CY · 2026-05-05 · unverdicted · none · ref 37 · internal anchor
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer