pith. sign in

arxiv: 2209.07858 · v2 · submitted 2022-08-23 · 💻 cs.CL · cs.AI· cs.CY

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords red teaminglanguage modelsRLHFscaling behaviorsharmful outputssafety evaluationdataset release
0
0 comments X

The pith

RLHF-trained language models become progressively harder to red-team into harmful outputs as they scale up in size, while other training approaches show no such improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether red teaming—deliberately prompting models to produce harmful responses—reveals different patterns of vulnerability depending on model size and training method. It compares plain language models, prompted helpful-honest-harmless models, rejection-sampling models, and RLHF models across three sizes from 2.7B to 52B parameters. The central result is that only the RLHF models grow harder to attack with scale; success rates for the other three categories remain roughly flat. The authors also release the full set of 38,961 attacks they collected and lay out their exact procedures so others can replicate or improve on the work.

Core claim

Across the tested model sizes and types, RLHF models show a clear increase in resistance to red team attacks as parameter count grows, whereas plain LMs, prompted LMs, and rejection-sampling LMs exhibit flat trends in attack success rate with scale. The work further catalogs a wide range of elicited harms, from overt offensive language to subtler non-violent unethical content, and supplies the complete attack dataset together with detailed methodology for community use.

What carries the argument

Comparative red-teaming success rate measured across four model training regimes (plain LM, prompted HH, rejection sampling, RLHF) at three parameter scales, with the RLHF regime as the variable that produces the observed scaling improvement in resistance.

If this is right

  • Larger RLHF models will likely need more advanced or automated red-teaming methods to continue uncovering residual harms.
  • The released attack dataset supplies a public benchmark that future safety methods can be measured against.
  • Training regimes other than RLHF do not appear to confer the same scaling advantage in resistance to attack.
  • Transparency in red-teaming procedures enables shared standards for evaluating model safety across labs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling pattern holds, RLHF-style training may provide a practical route for safety to improve alongside raw capability at larger scales.
  • The flat trends for non-RLHF models suggest that prompt engineering or rejection sampling alone are unlikely to close the safety gap as models grow.
  • The methods could be extended to test whether similar scaling resistance appears in multimodal or agentic systems trained with comparable feedback.

Load-bearing premise

The particular red-teaming instructions, prompts, and attack strategies used in the study are comprehensive enough to surface most or all of the harmful behaviors these models can exhibit.

What would settle it

A follow-up experiment that applies the same or closely matched attack distribution to a substantially larger RLHF model (for example 100B+ parameters) and measures an attack success rate that does not continue to decline, or that rises, would falsify the reported scaling trend for RLHF.

read the original abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper describes early efforts to red team language models to discover, measure, and reduce harmful outputs. It makes three contributions: (1) an investigation of scaling behaviors for red teaming success across three model sizes (2.7B, 13B, 52B) and four model types (plain LM, prompted helpful/honest/harmless, rejection sampling, and RLHF), finding that RLHF models become increasingly difficult to red team with scale while the other types show flat trends; (2) release of a dataset containing 38,961 red team attacks; and (3) detailed descriptions of instructions, processes, statistical methodologies, and uncertainties, along with analysis of the harmful outputs elicited (ranging from offensive language to subtle unethical behaviors).

Significance. If the scaling trends prove robust, the work supplies concrete empirical data on how alignment methods like RLHF affect vulnerability to adversarial elicitation of harms, informing safer deployment of larger models. The public release of the large attack dataset is a clear asset that enables independent verification and further research on red teaming techniques. The paper's emphasis on methodological transparency and explicit discussion of uncertainties is a positive contribution toward community standards in AI safety evaluation.

major comments (1)
  1. [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.
minor comments (2)
  1. [Methods] Methods section: While the paper states it exhaustively describes statistical methodologies, adding explicit formulas or pseudocode for how red team success rates and uncertainty estimates were computed (including any adjustments for multiple comparisons across model sizes) would improve reproducibility.
  2. [Dataset] Dataset description: The release of 38,961 attacks is valuable, but the paper would benefit from additional metadata on red teamer demographics, experience levels, and any training provided, to allow readers to assess potential sources of bias in the attack distribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's contributions to red teaming methods and the public dataset release. We address the major comment below.

read point-by-point responses
  1. Referee: [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.

    Authors: We thank the referee for highlighting this important potential confound in our scaling analysis. We agree that the absence of reported effort metrics leaves room for alternative interpretations. Our red teaming protocol used identical instructions, attack strategies, and stopping criteria for all model types and sizes, as described in the methods. However, we did not report per-model statistics on conversation length or prompt variants, and red teamers were not blinded to model identity. To address this directly, we will perform a post-hoc analysis of the released dataset of 38,961 attacks to compute effort-related metrics (average turns, unique variants attempted) broken down by model type and size, and include these results in the revised manuscript. This addition will support that the RLHF scaling trend reflects model behavior rather than differences in human effort. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling trends derived from direct measurements

full rationale

The paper reports empirical results from human red-teaming experiments across model scales and types, with the central claim (RLHF models become harder to red-team with scale while others show flat trends) resting on observed attack success rates in the released dataset of 38,961 attacks. No mathematical derivations, fitted parameters renamed as predictions, or self-citations are used to establish the scaling behaviors; the trends follow directly from the collected data without reduction to prior inputs or definitions. Self-citations appear only for background methods and are not load-bearing for the scaling observations. The work is self-contained against external benchmarks via the public dataset, which permits independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and does not introduce new mathematical axioms, free parameters, or invented entities; it builds on standard practices in machine learning and AI safety evaluation.

axioms (1)
  • domain assumption Human evaluators can reliably identify harmful outputs from language models
    Underlying the red teaming and data analysis process.

pith-pipeline@v0.9.0 · 5668 in / 1224 out tokens · 76750 ms · 2026-05-12T01:32:45.315412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

    cs.CR 2026-05 conditional novelty 8.0

    Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

  2. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  3. The Attribution Contract: Feature Attribution for Generative Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces the Attribution Contract specification to clarify feature attribution claims in generative language models by naming the output explained, eligible features, generative process, fixed elements, and attribut...

  4. Measuring Safety Alignment Effects in Autonomous Security Agents

    cs.CR 2026-05 conditional novelty 7.0

    A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...

  5. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  6. Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

    cs.CR 2026-05 unverdicted novelty 7.0

    Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

  7. Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery

    cs.CR 2026-05 unverdicted novelty 7.0

    PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.

  8. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  9. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 7.0

    Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

  10. Green Shielding: A User-Centric Approach Towards Trustworthy AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...

  11. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  12. Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

    cs.LG 2026-04 unverdicted novelty 7.0

    Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

  13. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

    cs.CL 2026-04 unverdicted novelty 7.0

    R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

  14. Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    cs.LG 2026-03 unverdicted novelty 7.0

    Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

  15. M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation

    cs.CY 2026-03 conditional novelty 7.0

    M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.

  16. Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

    cs.LG 2026-02 conditional novelty 7.0

    Direction-flipped influence audits show contextual cues shift LLM moral choices by 12-18 points on average across multiple benchmarks, revealing asymmetries, backfires, and inconsistencies in 40% of conditions.

  17. "Unlimited Realm of Exploration and Experimentation": Methods and Motivations of AI-Generated Sexual Content Creators

    cs.CY 2026-01 conditional novelty 7.0

    Interviews with 28 AIG-SC creators show motivations spanning sexual exploration, creative expression, technical experimentation, and occasional production of non-consensual intimate imagery.

  18. When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

    cs.CR 2025-10 unverdicted novelty 7.0

    CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.

  19. Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

    cs.LG 2025-09 unverdicted novelty 7.0

    Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.

  20. Collective Recourse for Generative Urban Visualizations

    cs.HC 2025-09 unverdicted novelty 7.0

    Collective recourse formalizes community reports to fix group harms in diffusion models for urban visualizations via a report-triage-fix-verify pipeline, four primitives, a mandate score, and synthetic evaluation of 2...

  21. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    cs.AI 2024-06 conditional novelty 7.0

    LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

  22. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  23. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    cs.CL 2023-10 conditional novelty 7.0

    Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

  24. Towards Measuring the Representation of Subjective Global Opinions in Language Models

    cs.CL 2023-06 conditional novelty 7.0

    LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...

  25. Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

    cs.CL 2026-05 conditional novelty 6.0

    LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.

  26. Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    MOOD benchmark shows guard models fail to generalize to OOD alignment failures in LLMs, but combining them with Mahalanobis and perplexity OOD detectors improves recall from 39% to 45% with better scaling than larger ...

  27. Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South

    cs.CY 2026-05 unverdicted novelty 6.0

    A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.

  28. Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

    cs.CR 2026-05 unverdicted novelty 6.0

    AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.

  29. PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

    cs.CL 2026-05 unverdicted novelty 6.0

    Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA ...

  30. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.

  31. Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

    cs.CR 2026-05 unverdicted novelty 6.0

    Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.

  32. Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.

  33. Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, out...

  34. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  35. Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

    cs.LG 2026-05 unverdicted novelty 6.0

    PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.

  36. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  37. Architecture, Not Scale: Circuit Localization in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.

  38. Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

    cs.HC 2026-05 unverdicted novelty 6.0

    A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

  39. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  40. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 6.0

    PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...

  41. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    cs.AI 2026-05 unverdicted novelty 6.0

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...

  42. From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

    cs.CL 2026-04 unverdicted novelty 6.0

    Paired prompt-response analysis shows 61% of LLM responses reduce harm severity, 36% preserve it, and 3% escalate, with Sexual content showing highest persistence and LLM graders exhibiting detection asymmetry.

  43. From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

    cs.CL 2026-04 unverdicted novelty 6.0

    Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.

  44. Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

    cs.LG 2026-04 unverdicted novelty 6.0

    Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.

  45. Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.

  46. Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

    cs.CY 2026-04 unverdicted novelty 6.0

    Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.

  47. AVISE: Framework for Evaluating the Security of AI Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.

  48. SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.

  49. AlignCultura: Towards Culturally Aligned Large Language Models?

    cs.CL 2026-04 unverdicted novelty 6.0

    Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

  50. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

  51. Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

    cs.AI 2026-04 unverdicted novelty 6.0

    Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

  52. Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

    cs.CL 2026-03 unverdicted novelty 6.0

    Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 1...

  53. Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

    cs.RO 2026-03 unverdicted novelty 6.0

    Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.

  54. RLHF May Not Reflect Genuine Preferences

    cs.HC 2026-01 unverdicted novelty 6.0

    RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.

  55. Tournament Informed Adversarial Quality Diversity

    cs.NE 2026-01 unverdicted novelty 6.0

    Tournament-informed task selection in adversarial QD produces higher quality and diversity in coevolved solutions across Pong, cat-and-mouse, and pursuers-evaders games.

  56. Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

    cs.CL 2025-12 unverdicted novelty 6.0

    Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.

  57. Graph-Regularized Sparse Autoencoders for LLM Safety Steering

    cs.LG 2025-12 unverdicted novelty 6.0

    GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.

  58. Evaluating AI Providers' Frontier Safety Frameworks

    cs.CY 2025-12 unverdicted novelty 6.0

    Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.

  59. Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

    cs.CL 2025-11 unverdicted novelty 6.0

    EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.

  60. Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

    cs.CL 2025-10 unverdicted novelty 6.0

    Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 98 Pith papers · 12 internal anchors

  1. [1]

    A. Abid, M. Farooqi, and J. Zou. Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6):461–463, June 2021. Number: 6 Publisher: Nature Publishing Group

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A General Language Assistant as a Labora- tory for Alignment. arXiv:2112.00861 [cs], Dec. 2021. arXi...

  3. [3]

    S. Avin, H. Belfield, M. Brundage, G. Krueger, J. Wang, A. Weller, M. Anderljung, I. Krawczuk, D. Krueger, J. Lebensold, T. Maharaj, and N. Zilberman. Filling gaps in trustworthy development of AI. Science, Dec. 2021. Publisher: American Association for the Advancement of Science

  4. [4]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. ...

  5. [5]

    P. Barrett. Research Highlights | Who Moderates the Social Media Giants? A Call to End Outsourcing - NYU Stern

  6. [6]

    Bartolo, T

    M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8830–8848, 2021. arXiv:2104.08678 [cs]

  7. [7]

    Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

    C. Basta, M. R. Costa-jussà, and N. Casas. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. arXiv:1904.08783 [cs], Apr. 2019. arXiv: 1904.08783

  8. [8]

    Bates, M

    D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1):1–48, 2015

  9. [9]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, pages 610–623, New York, NY , USA, Mar. 2021. Association for Computing Machinery

  10. [10]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

  11. [11]

    TowardtrustworthyAI development: Mechanisms for supporting verifiable claims,

    M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

  12. [12]

    Buchanan, A

    B. Buchanan, A. Lohn, M. Musser, and K. Sedova. Truth, Lies, and Automation, May 2021

  13. [13]

    Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs], June 2021. arXiv: 2012.07805

  14. [14]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

  15. [15]

    B. Dang, M. J. Riedl, and M. Lease. But Who Protects the Moderators? The Case of Crowdsourced Image Moderation, Jan. 2020. arXiv:1804.10999 [cs]

  16. [16]

    A. Das, B. Dang, and M. Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):33–42, Oct. 2020

  17. [17]

    Diener, D

    E. Diener, D. Wirtz, W. Tov, C. Kim-Prieto, D.-w. Choi, S. Oishi, and R. Biswas-Diener. New Well- being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indica- tors Research, 97(2):143–156, June 2010

  18. [18]

    Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

    E. Dinan, G. Abercrombie, A. S. Bergman, S. Spruit, D. Hovy, Y .-L. Boureau, and V . Rieser. Antici- pating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs], July

  19. [19]

    Dinan, A

    E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston. Queens are Powerful too: Mit- igating Gender Bias in Dialogue Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online, Nov. 2020. Association for Computational Linguistics

  20. [20]

    Build it break it fix it for dialogue safety: Robustness from adversarial human attack

    E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, Aug. 2019. arXiv:1908.06083 [cs]

  21. [21]

    Dixon, J

    L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, pages 67–73, New York, NY , USA, Dec. 2018. Association for Computing Machinery

  22. [22]

    Predictability and Surprise in Large Generative Models , url =

    D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield- Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark. Predictabil...

  23. [23]

    S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual Fairness in Text Classification through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 219–226, New York, NY , USA, Jan. 2019. Association for Computing Machinery

  24. [24]

    Datasheets for Datasets

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], Dec. 2021. arXiv: 1803.09010

  25. [25]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. ArXiv, abs/2009.11462, 2020

  26. [26]

    Gray and S

    M. Gray and S. Suri. Ghost Work. Mariner Books, 2019. 27

  27. [27]

    E. A. Holmes, E. L. James, T. Coode-Bate, and C. Deeprose. Can Playing the Computer Game “Tetris” Reduce the Build-Up of Flashbacks for Trauma? A Proposal from Cognitive Science. PLOS ONE, 4(1):e4153, Jan. 2009. Publisher: Public Library of Science

  28. [28]

    Hutchinson, V

    B. Hutchinson, V . Prabhakaran, E. Denton, K. Webster, Y . Zhong, and S. Denuyl. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5491–5501, Online, July 2020. Association for Computational Linguistics

  29. [29]

    Jia and P

    R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. InProceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics

  30. [30]

    Jiang and M

    Y . Jiang and M. Bansal. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736, Florence, Italy, July 2019. Association for Computational Linguistics

  31. [31]

    Karunakaran and R

    S. Karunakaran and R. Ramakrishan. Testing Stylistic Interventions to Reduce Emotional Impact of Content Moderation Workers.Proceedings of the AAAI Conference on Human Computation and Crowd- sourcing, 7:50–58, Oct. 2019

  32. [32]

    Kiela, M

    D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...

  33. [33]

    Measuring Bias in Contextualized Word Representations

    K. Kurita, N. Vyas, A. Pareek, A. W. Black, and Y . Tsvetkov. Measuring Bias in Contextualized Word Representations. arXiv:1906.07337 [cs], June 2019. arXiv: 1906.07337

  34. [34]

    P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards Understanding and Mitigating Social Biases in Language Models. arXiv:2106.13219 [cs], June 2021. arXiv: 2106.13219

  35. [35]

    S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs], Sept. 2021. arXiv: 2109.07958

  36. [36]

    A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. arXiv:2105.03023 [cs], June 2021. arXiv: 2105.03023

  37. [37]

    The radicalization risks of gpt-3 and advanced neural language models

    K. McGuffie and A. Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv:2009.06807 [cs], Sept. 2020. arXiv: 2009.06807

  38. [38]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Sept. 2020. arXiv:1802.03426 [cs, stat]

  39. [39]

    Mishkin, L

    P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 Preview - Risks and Limitations, 2022

  40. [40]

    Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4885–4901, Online, July 2020. Association for Computational Linguistics

  41. [41]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, Mar

  42. [42]

    arXiv:2203.02155 [cs]

  43. [43]

    Red Teaming Language Models with Language Models

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs], Feb. 2022. arXiv: 2202.03286

  44. [44]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hen- dricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Hig- gins, A. Creswell, N. McAleese, A. Wu, E. El...

  45. [45]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Gen- eration with CLIP Latents, Apr. 2022. arXiv:2204.06125 [cs]

  46. [46]

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics

  47. [47]

    Röttger, B

    P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert. HateCheck: Func- tional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online,...

  48. [48]

    M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social Bias Frames: Reasoning about Social and Power Implications of Language. arXiv:1911.03891 [cs], Apr. 2020. arXiv: 1911.03891

  49. [49]

    Process for adapting language models to society (PALMS) with values-targeted datasets

    I. Solaiman and C. Dennison. Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets. arXiv:2106.10328 [cs], Nov. 2021. arXiv: 2106.10328

  50. [50]

    Steiger, T

    M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, and M. Lease. The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI ’21, pages 1–14, New York, NY , USA, May 2021. Association for Compu...

  51. [51]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, Feb. 2014. arXiv:1312.6199 [cs]

  52. [52]

    & Ganguli, D

    A. Tamkin, M. Brundage, J. Clark, and D. Ganguli. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv:2102.02503 [cs], Feb. 2021. arXiv: 2102.02503

  53. [53]

    E. R. Thompson. Development and Validation of an Internationally Reliable Short-Form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38(2):227–242, Mar

  54. [54]

    Publisher: SAGE Publications Inc

  55. [55]

    LaMDA: Language Models for Dialog Applications

    R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. Meier-Hellstern, M. R. Morris, T. Do...

  56. [56]

    C. US. U.S. Census Bureau QuickFacts: United States, July 2021

  57. [57]

    Wallace, A

    E. Wallace, A. Williams, R. Jia, and D. Kiela. Analyzing Dynamic Adversarial Training Data in the Limit, Oct. 2021. arXiv:2110.08514 [cs]

  58. [58]

    Watson, L

    D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology , 54(6):1063–1070, June 1988

  59. [59]

    Ethical and social risks of harm from Language Models

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs], Dec. 20...

  60. [60]

    Welbl, A

    J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Cop- pin, and P.-S. Huang. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2447–2469, Punta Cana, Dominican Republic, Nov

  61. [61]

    Association for Computational Linguistics

  62. [62]

    H. Xu, Y . Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review, Oct. 2019. arXiv:1909.08072 [cs, stat]. 29

  63. [63]

    J. Xu, D. Ju, M. Li, Y .-L. Boureau, J. Weston, and E. Dinan. Bot-Adversarial Dialogue for Safe Conversational Agents. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y . Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

  64. [64]

    D. M. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt-Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, and N. Thomas. Adversarial Training for High-Stakes Reliability, May 2022. arXiv:2205.01663 [cs]. 30