super hub Canonical reference

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Amanda Askell, Deep Ganguli, Jackson Kernion, Liane Lovitt, Saurav Kadavath, Yuntao Bai · 2022 · cs.CL · arXiv 2209.07858

Canonical reference. 86% of citing Pith papers cite this work as background.

147 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 147 citing papers more from Amanda Askell arXiv PDF

abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 method 3 dataset 1

citation-polarity summary

background 24 use method 2 support 1 use dataset 1

claims ledger

abstract We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to re

authors

Amanda Askell Deep Ganguli Jackson Kernion Liane Lovitt Saurav Kadavath Yuntao Bai

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Releases the first public safety evaluation dataset for Albanian LLMs with 2,951 prompts spanning 11 categories including self-harm, violence, and radicalization.

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation

cs.CY · 2026-03-27 · conditional · novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.

Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

cs.LG · 2026-02-26 · conditional · novelty 7.0

Direction-flipped influence audits show contextual cues shift LLM moral choices by 12-18 points on average across multiple benchmarks, revealing asymmetries, backfires, and inconsistencies in 40% of conditions.

citing papers explorer

Showing 47 of 147 citing papers.

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety cs.CL · 2026-06-26 · unverdicted · none · ref 11 · internal anchor
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
One Year Later...The Harms Persist, But So Do We! cs.CL · 2026-06-22 · unverdicted · none · ref 26 · 2 links · internal anchor
LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.
SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data cs.AI · 2026-06-15 · unverdicted · none · ref 1 · internal anchor
SpecAlign synthesizes boundary-aware preference pairs directly from structured model specifications to train LLMs for improved rule compliance.
Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails cs.CL · 2026-06-04 · unverdicted · none · ref 11 · internal anchor
An audit finds language model filters and guardrails disproportionately suppress mentions of marginalized groups via lexical cues while failing to catch explicit harms.
Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning cs.LG · 2026-06-01 · unverdicted · none · ref 79 · internal anchor
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 29 · internal anchor
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums? cs.CY · 2026-05-30 · unverdicted · none · ref 41 · internal anchor
Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles cs.HC · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
Soft Specialists: $\alpha$-R\'enyi Ensembles for Uncertainty-Aware LLM Post-Training stat.ML · 2026-05-26 · unverdicted · none · ref 35 · internal anchor
An α-Rényi variational ensemble method learns distributions over LoRA adapter parameters for uncertainty-aware LLM post-training, balancing individual model plausibility with complementary specialization.
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting cs.CL · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts cs.CR · 2026-05-04 · accept · none · ref 38 · internal anchor
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Surrogate modeling for interpreting black-box LLMs in medical predictions cs.CL · 2026-04-22 · unverdicted · none · ref 40 · internal anchor
A surrogate modeling method approximates LLM-encoded medical knowledge via prompting to quantify variable influence and flag inaccuracies and racial biases.
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility cs.SE · 2026-04-16 · unverdicted · none · ref 22 · internal anchor
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization cs.CR · 2026-04-08 · unverdicted · none · ref 21 · internal anchor
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts cs.CL · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 8 · internal anchor
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications cs.SE · 2026-03-13 · unverdicted · none · ref 8 · internal anchor
An automated self-testing framework with evidence-based quality gates for LLM application releases was evaluated in a longitudinal case study of a multi-agent conversational AI system, identifying rollback builds and supporting stable quality over four weeks.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 12 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
Users as Annotators: LLM Preference Learning from Comparison Mode cs.CL · 2025-10-10 · unverdicted · none · ref 9 · internal anchor
Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks cs.LG · 2025-09-08 · unverdicted · none · ref 54 · internal anchor
Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.
CASE: An Agentic AI Framework for Enhancing Scam Intelligence in Digital Payments cs.AI · 2025-08-27 · unverdicted · none · ref 5 · internal anchor
CASE is a novel agentic AI system that proactively interviews scam victims using LLMs to collect detailed intelligence, which is then structured for use in scam prevention, resulting in a 21% increase in enforcements on Google Pay India.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 54 · internal anchor
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 46 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation cs.CL · 2026-06-24 · unverdicted · none · ref 6 · internal anchor
Introduces a multi-role red teaming framework using attacker and jury models that increases attack success rates by up to 7.9% on LLM faithfulness in question-answering tasks.
The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI cs.CL · 2026-06-18 · unverdicted · none · ref 24 · internal anchor
Explores options for using LLMs to scale deliberation and empower marginalized groups via systemic-functional linguistics concepts while cautioning against over- and under-claiming.
The Case for Model Science: Verify, Explore, Steer, Refine cs.AI · 2026-05-31 · unverdicted · none · ref 49 · internal anchor
Position paper proposing Model Science as a discipline to systematically analyze AI model behavior beyond benchmarks, drawing analogies from cognitive science, neuroscience, medicine, and agriculture.
An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
Empirical audit of k-NAF in Anchored Decoding finds budgets are not exhausted on tested workloads, with high proxy ratios attributable to small-sample artifacts.
Position: AI Safety Requires Effective Controllability cs.AI · 2026-05-26 · unverdicted · none · ref 5 · internal anchor
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.
Responsible Agentic AI Requires Explicit Provenance cs.AI · 2026-05-16 · unverdicted · none · ref 20 · internal anchor
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration cs.HC · 2026-05-08 · unverdicted · none · ref 104 · internal anchor
A review synthesizes affective dynamics as a coordination layer in human-AI agent collaboration and proposes a framework for trust calibration, delegation, error correction, and governance.
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems cs.AI · 2026-05-05 · unverdicted · none · ref 36 · internal anchor
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
Responsible Federated LLMs via Safety Filtering and Constitutional AI cs.CL · 2025-02-23 · unverdicted · none · ref 11 · internal anchor
Integrates safety filtering and constitutional AI into FedLLM, reporting over 20% safety improvement on AdvBench.
OpenAI o1 System Card cs.AI · 2024-12-21 · unverdicted · none · ref 8 · internal anchor
OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions cs.AI · 2024-08-23 · unverdicted · none · ref 223 · internal anchor
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 26 · internal anchor
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Online Safety Monitoring for LLMs cs.AI · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.
AI Alignment From Social Choice Perspectives cs.AI · 2026-06-19 · unverdicted · none · ref 79 · internal anchor
This survey examines applications of social choice theory to aggregating human feedback in AI alignment, identifying failure modes and expanding design options for disagreement.
Understanding Censorship in Large Language Models: From Mechanisms to Governance cs.CY · 2026-06-16 · unverdicted · none · ref 22 · internal anchor
Synthesizes mechanisms of LLM censorship across the model lifecycle and argues that the key issue is making moderation proportionate, accountable, pluralistic, and auditable rather than debating whether moderation should occur.
AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue cs.CL · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
AERIC uses a 387-parameter head on LLM hidden states for same-pass anticipatory detection of implicit harm, reporting AUROC gains on DiaSafety and Harmful Advice plus low-latency trigger rates on HarmBench and SocialHarmBench.
Brainrot: Deskilling and Addiction are Overlooked AI Risks cs.CY · 2026-05-05 · unverdicted · none · ref 37 · internal anchor
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 134 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
SAGE Celer 2.6 Technical Card cs.CL · 2026-03-24 · unverdicted · none · ref 5 · internal anchor
SAGE Celer 2.6 is a new line of language models with inverse reasoning training, integrated vision, and strong performance on math, coding, and South Asian language benchmarks.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 178 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
RLHF May Not Reflect Genuine Preferences cs.HC · 2026-01-31 · unreviewed · ref 4 · internal anchor
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 44 · internal anchor

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer