Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Amanda Askell; Andy Jones; Anna Chen; Ben Mann; Catherine Olsson; Chris Olah; Danny Hernandez; Dario Amodei; Dawn Drain; Deep Ganguli

arxiv: 2209.07858 · v2 · submitted 2022-08-23 · 💻 cs.CL · cs.AI· cs.CY

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli , Liane Lovitt , Jackson Kernion , Amanda Askell , Yuntao Bai , Saurav Kadavath , Ben Mann , Ethan Perez

show 28 more authors

Nicholas Schiefer Kamal Ndousse Andy Jones Sam Bowman Anna Chen Tom Conerly Nova DasSarma Dawn Drain Nelson Elhage Sheer El-Showk Stanislav Fort Zac Hatfield-Dodds Tom Henighan Danny Hernandez Tristan Hume Josh Jacobson Scott Johnston Shauna Kravec Catherine Olsson Sam Ringer Eli Tran-Johnson Dario Amodei Tom Brown Nicholas Joseph Sam McCandlish Chris Olah Jared Kaplan Jack Clark

This is my paper

Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords red teaminglanguage modelsRLHFscaling behaviorsharmful outputssafety evaluationdataset release

0 comments

The pith

RLHF-trained language models become progressively harder to red-team into harmful outputs as they scale up in size, while other training approaches show no such improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether red teaming—deliberately prompting models to produce harmful responses—reveals different patterns of vulnerability depending on model size and training method. It compares plain language models, prompted helpful-honest-harmless models, rejection-sampling models, and RLHF models across three sizes from 2.7B to 52B parameters. The central result is that only the RLHF models grow harder to attack with scale; success rates for the other three categories remain roughly flat. The authors also release the full set of 38,961 attacks they collected and lay out their exact procedures so others can replicate or improve on the work.

Core claim

Across the tested model sizes and types, RLHF models show a clear increase in resistance to red team attacks as parameter count grows, whereas plain LMs, prompted LMs, and rejection-sampling LMs exhibit flat trends in attack success rate with scale. The work further catalogs a wide range of elicited harms, from overt offensive language to subtler non-violent unethical content, and supplies the complete attack dataset together with detailed methodology for community use.

What carries the argument

Comparative red-teaming success rate measured across four model training regimes (plain LM, prompted HH, rejection sampling, RLHF) at three parameter scales, with the RLHF regime as the variable that produces the observed scaling improvement in resistance.

If this is right

Larger RLHF models will likely need more advanced or automated red-teaming methods to continue uncovering residual harms.
The released attack dataset supplies a public benchmark that future safety methods can be measured against.
Training regimes other than RLHF do not appear to confer the same scaling advantage in resistance to attack.
Transparency in red-teaming procedures enables shared standards for evaluating model safety across labs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling pattern holds, RLHF-style training may provide a practical route for safety to improve alongside raw capability at larger scales.
The flat trends for non-RLHF models suggest that prompt engineering or rejection sampling alone are unlikely to close the safety gap as models grow.
The methods could be extended to test whether similar scaling resistance appears in multimodal or agentic systems trained with comparable feedback.

Load-bearing premise

The particular red-teaming instructions, prompts, and attack strategies used in the study are comprehensive enough to surface most or all of the harmful behaviors these models can exhibit.

What would settle it

A follow-up experiment that applies the same or closely matched attack distribution to a substantially larger RLHF model (for example 100B+ parameters) and measures an attack success rate that does not continue to decline, or that rises, would falsify the reported scaling trend for RLHF.

read the original abstract

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLHF models get harder to red team with scale while the other types stay flat, plus a released dataset of 39k attacks.

read the letter

The main thing to know is that this paper finds RLHF models become progressively harder to red team as they scale from 2.7B to 52B parameters, while plain LMs, prompted helpful-honest-harmless versions, and rejection sampling models show flat trends. They also release the full set of 38,961 attacks for others to examine. That comparison across model types and sizes is the clearest new empirical piece here. The transparency about their red teaming instructions, statistical methods, and remaining uncertainties is useful in an area where practices are still settling. Releasing the data lets the community check the range of harmful outputs they surfaced, from overt offenses to subtler unethical cases. That combination of scaling observation and open resource is what makes the work worth attention. The soft spot is the one the stress-test note raises. Red teaming relies on human attackers who were not blinded to model type, so effort, strategy, or stopping rules could have shifted depending on what they saw. The paper describes its processes in detail, but without reported per-model metrics like turns per attack or fixed prompt templates, the RLHF scaling advantage could partly reflect differences in how the attacks were generated rather than model properties alone. The dataset allows later checks, but it does not retroactively fix collection-time variation. This is for alignment and safety researchers who want concrete scaling data or a starting corpus of attacks. It is not a finished method but gives usable observations and a resource. I would send it for peer review. The empirical comparison and data release are substantial enough to justify referee time, even if tighter controls on attack effort would strengthen the scaling claim.

Referee Report

1 major / 2 minor

Summary. The paper describes early efforts to red team language models to discover, measure, and reduce harmful outputs. It makes three contributions: (1) an investigation of scaling behaviors for red teaming success across three model sizes (2.7B, 13B, 52B) and four model types (plain LM, prompted helpful/honest/harmless, rejection sampling, and RLHF), finding that RLHF models become increasingly difficult to red team with scale while the other types show flat trends; (2) release of a dataset containing 38,961 red team attacks; and (3) detailed descriptions of instructions, processes, statistical methodologies, and uncertainties, along with analysis of the harmful outputs elicited (ranging from offensive language to subtle unethical behaviors).

Significance. If the scaling trends prove robust, the work supplies concrete empirical data on how alignment methods like RLHF affect vulnerability to adversarial elicitation of harms, informing safer deployment of larger models. The public release of the large attack dataset is a clear asset that enables independent verification and further research on red teaming techniques. The paper's emphasis on methodological transparency and explicit discussion of uncertainties is a positive contribution toward community standards in AI safety evaluation.

major comments (1)

[Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.

minor comments (2)

[Methods] Methods section: While the paper states it exhaustively describes statistical methodologies, adding explicit formulas or pseudocode for how red team success rates and uncertainty estimates were computed (including any adjustments for multiple comparisons across model sizes) would improve reproducibility.
[Dataset] Dataset description: The release of 38,961 attacks is valuable, but the paper would benefit from additional metadata on red teamer demographics, experience levels, and any training provided, to allow readers to assess potential sources of bias in the attack distribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's contributions to red teaming methods and the public dataset release. We address the major comment below.

read point-by-point responses

Referee: [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.

Authors: We thank the referee for highlighting this important potential confound in our scaling analysis. We agree that the absence of reported effort metrics leaves room for alternative interpretations. Our red teaming protocol used identical instructions, attack strategies, and stopping criteria for all model types and sizes, as described in the methods. However, we did not report per-model statistics on conversation length or prompt variants, and red teamers were not blinded to model identity. To address this directly, we will perform a post-hoc analysis of the released dataset of 38,961 attacks to compute effort-related metrics (average turns, unique variants attempted) broken down by model type and size, and include these results in the revised manuscript. This addition will support that the RLHF scaling trend reflects model behavior rather than differences in human effort. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling trends derived from direct measurements

full rationale

The paper reports empirical results from human red-teaming experiments across model scales and types, with the central claim (RLHF models become harder to red-team with scale while others show flat trends) resting on observed attack success rates in the released dataset of 38,961 attacks. No mathematical derivations, fitted parameters renamed as predictions, or self-citations are used to establish the scaling behaviors; the trends follow directly from the collected data without reduction to prior inputs or definitions. Self-citations appear only for background methods and are not load-bearing for the scaling observations. The work is self-contained against external benchmarks via the public dataset, which permits independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and does not introduce new mathematical axioms, free parameters, or invented entities; it builds on standard practices in machine learning and AI safety evaluation.

axioms (1)

domain assumption Human evaluators can reliably identify harmful outputs from language models
Underlying the red teaming and data analysis process.

pith-pipeline@v0.9.0 · 5668 in / 1224 out tokens · 76750 ms · 2026-05-12T01:32:45.315412+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
The Attribution Contract: Feature Attribution for Generative Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the Attribution Contract specification to clarify feature attribution claims in generative language models by naming the output explained, eligible features, generative process, fixed elements, and attribut...
Measuring Safety Alignment Effects in Autonomous Security Agents
cs.CR 2026-05 conditional novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery
cs.CR 2026-05 unverdicted novelty 7.0

PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
cs.LG 2026-04 unverdicted novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
cs.CL 2026-04 unverdicted novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
cs.LG 2026-03 unverdicted novelty 7.0

Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation
cs.CY 2026-03 conditional novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
cs.LG 2026-02 conditional novelty 7.0

Direction-flipped influence audits show contextual cues shift LLM moral choices by 12-18 points on average across multiple benchmarks, revealing asymmetries, backfires, and inconsistencies in 40% of conditions.
"Unlimited Realm of Exploration and Experimentation": Methods and Motivations of AI-Generated Sexual Content Creators
cs.CY 2026-01 conditional novelty 7.0

Interviews with 28 AIG-SC creators show motivations spanning sexual exploration, creative expression, technical experimentation, and occasional production of non-consensual intimate imagery.
When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
cs.CR 2025-10 unverdicted novelty 7.0

CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.
Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary
cs.LG 2025-09 unverdicted novelty 7.0

Defines Decision Potential Surface (DPS) whose zero isohypse equals an LLM decision boundary and supplies a K-sample approximation algorithm with derived upper bounds on absolute, expected, and concentration errors.
Collective Recourse for Generative Urban Visualizations
cs.HC 2025-09 unverdicted novelty 7.0

Collective recourse formalizes community reports to fix group harms in diffusion models for urban visualizations via a report-triage-fix-verify pipeline, four primitives, a mandate score, and synthetic evaluation of 2...
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
cs.CL 2023-10 conditional novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Towards Measuring the Representation of Subjective Global Opinions in Language Models
cs.CL 2023-06 conditional novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
cs.CL 2026-05 conditional novelty 6.0

LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

MOOD benchmark shows guard models fail to generalize to OOD alignment failures in LLMs, but combining them with Mahalanobis and perplexity OOD detectors improves recall from 39% to 45% with better scaling than larger ...
Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South
cs.CY 2026-05 unverdicted novelty 6.0

A participatory red-teaming project in the Global South created the PLACES dataset of 26k T2I failure examples that reveal unique cultural and linguistic harms missed by existing safety frameworks.
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models
cs.CR 2026-05 unverdicted novelty 6.0

AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
cs.CL 2026-05 unverdicted novelty 6.0

Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA ...
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
cs.AI 2026-05 unverdicted novelty 6.0

Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs
cs.CR 2026-05 unverdicted novelty 6.0

Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, out...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
cs.LG 2026-05 unverdicted novelty 6.0

PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Architecture, Not Scale: Circuit Localization in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
cs.HC 2026-05 unverdicted novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
cs.CL 2026-04 unverdicted novelty 6.0

Paired prompt-response analysis shows 61% of LLM responses reduce harm severity, 36% preserve it, and 3% escalate, with Sexual content showing highest persistence and LLM graders exhibiting detection asymmetry.
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
cs.CL 2026-04 unverdicted novelty 6.0

Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
cs.LG 2026-04 unverdicted novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
cs.CY 2026-04 unverdicted novelty 6.0

Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
AVISE: Framework for Evaluating the Security of AI Systems
cs.CR 2026-04 unverdicted novelty 6.0

AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
cs.CR 2026-04 unverdicted novelty 6.0

SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI
cs.CL 2026-03 unverdicted novelty 6.0

Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 1...
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
cs.RO 2026-03 unverdicted novelty 6.0

Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.
RLHF May Not Reflect Genuine Preferences
cs.HC 2026-01 unverdicted novelty 6.0

RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.
Tournament Informed Adversarial Quality Diversity
cs.NE 2026-01 unverdicted novelty 6.0

Tournament-informed task selection in adversarial QD produces higher quality and diversity in coevolved solutions across Pong, cat-and-mouse, and pursuers-evaders games.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
cs.CL 2025-12 unverdicted novelty 6.0

Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Graph-Regularized Sparse Autoencoders for LLM Safety Steering
cs.LG 2025-12 unverdicted novelty 6.0

GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.
Evaluating AI Providers' Frontier Safety Frameworks
cs.CY 2025-12 unverdicted novelty 6.0

Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
cs.CL 2025-11 unverdicted novelty 6.0

EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
cs.CL 2025-10 unverdicted novelty 6.0

Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 98 Pith papers · 12 internal anchors

[1]

A. Abid, M. Farooqi, and J. Zou. Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6):461–463, June 2021. Number: 6 Publisher: Nature Publishing Group

work page 2021
[2]

A General Language Assistant as a Laboratory for Alignment

A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatﬁeld-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A General Language Assistant as a Labora- tory for Alignment. arXiv:2112.00861 [cs], Dec. 2021. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

S. Avin, H. Belﬁeld, M. Brundage, G. Krueger, J. Wang, A. Weller, M. Anderljung, I. Krawczuk, D. Krueger, J. Lebensold, T. Maharaj, and N. Zilberman. Filling gaps in trustworthy development of AI. Science, Dec. 2021. Publisher: American Association for the Advancement of Science

work page 2021
[4]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatﬁeld- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

P. Barrett. Research Highlights | Who Moderates the Social Media Giants? A Call to End Outsourcing - NYU Stern

work page
[6]

Bartolo, T

M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8830–8848, 2021. arXiv:2104.08678 [cs]

work page arXiv 2021
[7]

Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

C. Basta, M. R. Costa-jussà, and N. Casas. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. arXiv:1904.08783 [cs], Apr. 2019. arXiv: 1904.08783

work page Pith review arXiv 1904
[8]

Bates, M

D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1):1–48, 2015

work page 2015
[9]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, pages 610–623, New York, NY , USA, Mar. 2021. Association for Computing Machinery

work page 2021
[10]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

TowardtrustworthyAI development: Mechanisms for supporting verifiable claims,

M. Brundage, S. Avin, J. Wang, H. Belﬁeld, G. Krueger, G. Hadﬁeld, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

work page arXiv 2004
[12]

Buchanan, A

B. Buchanan, A. Lohn, M. Musser, and K. Sedova. Truth, Lies, and Automation, May 2021

work page 2021
[13]

Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs], June 2021. arXiv: 2012.07805

work page arXiv 2012
[14]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

work page 2017
[15]

B. Dang, M. J. Riedl, and M. Lease. But Who Protects the Moderators? The Case of Crowdsourced Image Moderation, Jan. 2020. arXiv:1804.10999 [cs]

work page arXiv 2020
[16]

A. Das, B. Dang, and M. Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):33–42, Oct. 2020

work page 2020
[17]

Diener, D

E. Diener, D. Wirtz, W. Tov, C. Kim-Prieto, D.-w. Choi, S. Oishi, and R. Biswas-Diener. New Well- being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indica- tors Research, 97(2):143–156, June 2010

work page 2010
[18]

Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

E. Dinan, G. Abercrombie, A. S. Bergman, S. Spruit, D. Hovy, Y .-L. Boureau, and V . Rieser. Antici- pating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs], July

work page arXiv
[19]

Dinan, A

E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston. Queens are Powerful too: Mit- igating Gender Bias in Dialogue Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online, Nov. 2020. Association for Computational Linguistics

work page 2020
[20]

Build it break it fix it for dialogue safety: Robustness from adversarial human attack

E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, Aug. 2019. arXiv:1908.06083 [cs]

work page arXiv 2019
[21]

Dixon, J

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and Mitigating Unintended Bias in Text Classiﬁcation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, pages 67–73, New York, NY , USA, Dec. 2018. Association for Computing Machinery

work page 2018
[22]

Predictability and Surprise in Large Generative Models , url =

D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatﬁeld- Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark. Predictabil...

work page arXiv 2022
[23]

S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual Fairness in Text Classiﬁcation through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 219–226, New York, NY , USA, Jan. 2019. Association for Computing Machinery

work page 2019
[24]

Datasheets for Datasets

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], Dec. 2021. arXiv: 1803.09010

work page arXiv 2021
[25]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. ArXiv, abs/2009.11462, 2020

work page internal anchor Pith review arXiv 2009
[26]

Gray and S

M. Gray and S. Suri. Ghost Work. Mariner Books, 2019. 27

work page 2019
[27]

E. A. Holmes, E. L. James, T. Coode-Bate, and C. Deeprose. Can Playing the Computer Game “Tetris” Reduce the Build-Up of Flashbacks for Trauma? A Proposal from Cognitive Science. PLOS ONE, 4(1):e4153, Jan. 2009. Publisher: Public Library of Science

work page 2009
[28]

Hutchinson, V

B. Hutchinson, V . Prabhakaran, E. Denton, K. Webster, Y . Zhong, and S. Denuyl. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5491–5501, Online, July 2020. Association for Computational Linguistics

work page 2020
[29]

Jia and P

R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. InProceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics

work page 2017
[30]

Jiang and M

Y . Jiang and M. Bansal. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736, Florence, Italy, July 2019. Association for Computational Linguistics

work page 2019
[31]

Karunakaran and R

S. Karunakaran and R. Ramakrishan. Testing Stylistic Interventions to Reduce Emotional Impact of Content Moderation Workers.Proceedings of the AAAI Conference on Human Computation and Crowd- sourcing, 7:50–58, Oct. 2019

work page 2019
[32]

Kiela, M

D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...

work page 2021
[33]

Measuring Bias in Contextualized Word Representations

K. Kurita, N. Vyas, A. Pareek, A. W. Black, and Y . Tsvetkov. Measuring Bias in Contextualized Word Representations. arXiv:1906.07337 [cs], June 2019. arXiv: 1906.07337

work page Pith review arXiv 1906
[34]

P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards Understanding and Mitigating Social Biases in Language Models. arXiv:2106.13219 [cs], June 2021. arXiv: 2106.13219

work page arXiv 2021
[35]

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs], Sept. 2021. arXiv: 2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. arXiv:2105.03023 [cs], June 2021. arXiv: 2105.03023

work page arXiv 2021
[37]

The radicalization risks of gpt-3 and advanced neural language models

K. McGufﬁe and A. Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv:2009.06807 [cs], Sept. 2020. arXiv: 2009.06807

work page arXiv 2009
[38]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Sept. 2020. arXiv:1802.03426 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[39]

Mishkin, L

P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 Preview - Risks and Limitations, 2022

work page 2022
[40]

Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4885–4901, Online, July 2020. Association for Computational Linguistics

work page 2020
[41]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, Mar

work page
[42]

arXiv:2203.02155 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs], Feb. 2022. arXiv: 2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hen- dricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Hig- gins, A. Creswell, N. McAleese, A. Wu, E. El...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Gen- eration with CLIP Latents, Apr. 2022. arXiv:2204.06125 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics

work page 2020
[47]

Röttger, B

P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert. HateCheck: Func- tional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online,...

work page 2021
[48]

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social Bias Frames: Reasoning about Social and Power Implications of Language. arXiv:1911.03891 [cs], Apr. 2020. arXiv: 1911.03891

work page arXiv 1911
[49]

Process for adapting language models to society (PALMS) with values-targeted datasets

I. Solaiman and C. Dennison. Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets. arXiv:2106.10328 [cs], Nov. 2021. arXiv: 2106.10328

work page arXiv 2021
[50]

Steiger, T

M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, and M. Lease. The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI ’21, pages 1–14, New York, NY , USA, May 2021. Association for Compu...

work page 2021
[51]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, Feb. 2014. arXiv:1312.6199 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[52]

& Ganguli, D

A. Tamkin, M. Brundage, J. Clark, and D. Ganguli. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv:2102.02503 [cs], Feb. 2021. arXiv: 2102.02503

work page arXiv 2021
[53]

E. R. Thompson. Development and Validation of an Internationally Reliable Short-Form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38(2):227–242, Mar

work page
[54]

Publisher: SAGE Publications Inc

work page
[55]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. Meier-Hellstern, M. R. Morris, T. Do...

work page Pith review arXiv 2022
[56]

C. US. U.S. Census Bureau QuickFacts: United States, July 2021

work page 2021
[57]

Wallace, A

E. Wallace, A. Williams, R. Jia, and D. Kiela. Analyzing Dynamic Adversarial Training Data in the Limit, Oct. 2021. arXiv:2110.08514 [cs]

work page arXiv 2021
[58]

Watson, L

D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology , 54(6):1063–1070, June 1988

work page 1988
[59]

Ethical and social risks of harm from Language Models

L. Weidinger, J. Mellor, M. Rauh, C. Grifﬁn, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs], Dec. 20...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Welbl, A

J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Cop- pin, and P.-S. Huang. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2447–2469, Punta Cana, Dominican Republic, Nov

work page 2021
[61]

Association for Computational Linguistics

work page
[62]

H. Xu, Y . Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review, Oct. 2019. arXiv:1909.08072 [cs, stat]. 29

work page arXiv 2019
[63]

J. Xu, D. Ju, M. Li, Y .-L. Boureau, J. Weston, and E. Dinan. Bot-Adversarial Dialogue for Safe Conversational Agents. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y . Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

work page 2021
[64]

D. M. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt-Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, and N. Thomas. Adversarial Training for High-Stakes Reliability, May 2022. arXiv:2205.01663 [cs]. 30

work page arXiv 2022

[1] [1]

A. Abid, M. Farooqi, and J. Zou. Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6):461–463, June 2021. Number: 6 Publisher: Nature Publishing Group

work page 2021

[2] [2]

A General Language Assistant as a Laboratory for Alignment

A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatﬁeld-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A General Language Assistant as a Labora- tory for Alignment. arXiv:2112.00861 [cs], Dec. 2021. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

S. Avin, H. Belﬁeld, M. Brundage, G. Krueger, J. Wang, A. Weller, M. Anderljung, I. Krawczuk, D. Krueger, J. Lebensold, T. Maharaj, and N. Zilberman. Filling gaps in trustworthy development of AI. Science, Dec. 2021. Publisher: American Association for the Advancement of Science

work page 2021

[4] [4]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatﬁeld- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

P. Barrett. Research Highlights | Who Moderates the Social Media Giants? A Call to End Outsourcing - NYU Stern

work page

[6] [6]

Bartolo, T

M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8830–8848, 2021. arXiv:2104.08678 [cs]

work page arXiv 2021

[7] [7]

Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

C. Basta, M. R. Costa-jussà, and N. Casas. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. arXiv:1904.08783 [cs], Apr. 2019. arXiv: 1904.08783

work page Pith review arXiv 1904

[8] [8]

Bates, M

D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1):1–48, 2015

work page 2015

[9] [9]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, pages 610–623, New York, NY , USA, Mar. 2021. Association for Computing Machinery

work page 2021

[10] [10]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

TowardtrustworthyAI development: Mechanisms for supporting verifiable claims,

M. Brundage, S. Avin, J. Wang, H. Belﬁeld, G. Krueger, G. Hadﬁeld, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

work page arXiv 2004

[12] [12]

Buchanan, A

B. Buchanan, A. Lohn, M. Musser, and K. Sedova. Truth, Lies, and Automation, May 2021

work page 2021

[13] [13]

Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs], June 2021. arXiv: 2012.07805

work page arXiv 2012

[14] [14]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017

work page 2017

[15] [15]

B. Dang, M. J. Riedl, and M. Lease. But Who Protects the Moderators? The Case of Crowdsourced Image Moderation, Jan. 2020. arXiv:1804.10999 [cs]

work page arXiv 2020

[16] [16]

A. Das, B. Dang, and M. Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):33–42, Oct. 2020

work page 2020

[17] [17]

Diener, D

E. Diener, D. Wirtz, W. Tov, C. Kim-Prieto, D.-w. Choi, S. Oishi, and R. Biswas-Diener. New Well- being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indica- tors Research, 97(2):143–156, June 2010

work page 2010

[18] [18]

Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

E. Dinan, G. Abercrombie, A. S. Bergman, S. Spruit, D. Hovy, Y .-L. Boureau, and V . Rieser. Antici- pating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs], July

work page arXiv

[19] [19]

Dinan, A

E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston. Queens are Powerful too: Mit- igating Gender Bias in Dialogue Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online, Nov. 2020. Association for Computational Linguistics

work page 2020

[20] [20]

Build it break it fix it for dialogue safety: Robustness from adversarial human attack

E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, Aug. 2019. arXiv:1908.06083 [cs]

work page arXiv 2019

[21] [21]

Dixon, J

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and Mitigating Unintended Bias in Text Classiﬁcation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, pages 67–73, New York, NY , USA, Dec. 2018. Association for Computing Machinery

work page 2018

[22] [22]

Predictability and Surprise in Large Generative Models , url =

D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatﬁeld- Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark. Predictabil...

work page arXiv 2022

[23] [23]

S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual Fairness in Text Classiﬁcation through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 219–226, New York, NY , USA, Jan. 2019. Association for Computing Machinery

work page 2019

[24] [24]

Datasheets for Datasets

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], Dec. 2021. arXiv: 1803.09010

work page arXiv 2021

[25] [25]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. ArXiv, abs/2009.11462, 2020

work page internal anchor Pith review arXiv 2009

[26] [26]

Gray and S

M. Gray and S. Suri. Ghost Work. Mariner Books, 2019. 27

work page 2019

[27] [27]

E. A. Holmes, E. L. James, T. Coode-Bate, and C. Deeprose. Can Playing the Computer Game “Tetris” Reduce the Build-Up of Flashbacks for Trauma? A Proposal from Cognitive Science. PLOS ONE, 4(1):e4153, Jan. 2009. Publisher: Public Library of Science

work page 2009

[28] [28]

Hutchinson, V

B. Hutchinson, V . Prabhakaran, E. Denton, K. Webster, Y . Zhong, and S. Denuyl. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5491–5501, Online, July 2020. Association for Computational Linguistics

work page 2020

[29] [29]

Jia and P

R. Jia and P. Liang. Adversarial Examples for Evaluating Reading Comprehension Systems. InProceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics

work page 2017

[30] [30]

Jiang and M

Y . Jiang and M. Bansal. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736, Florence, Italy, July 2019. Association for Computational Linguistics

work page 2019

[31] [31]

Karunakaran and R

S. Karunakaran and R. Ramakrishan. Testing Stylistic Interventions to Reduce Emotional Impact of Content Moderation Workers.Proceedings of the AAAI Conference on Human Computation and Crowd- sourcing, 7:50–58, Oct. 2019

work page 2019

[32] [32]

Kiela, M

D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...

work page 2021

[33] [33]

Measuring Bias in Contextualized Word Representations

K. Kurita, N. Vyas, A. Pareek, A. W. Black, and Y . Tsvetkov. Measuring Bias in Contextualized Word Representations. arXiv:1906.07337 [cs], June 2019. arXiv: 1906.07337

work page Pith review arXiv 1906

[34] [34]

P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov. Towards Understanding and Mitigating Social Biases in Language Models. arXiv:2106.13219 [cs], June 2021. arXiv: 2106.13219

work page arXiv 2021

[35] [35]

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs], Sept. 2021. arXiv: 2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. arXiv:2105.03023 [cs], June 2021. arXiv: 2105.03023

work page arXiv 2021

[37] [37]

The radicalization risks of gpt-3 and advanced neural language models

K. McGufﬁe and A. Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv:2009.06807 [cs], Sept. 2020. arXiv: 2009.06807

work page arXiv 2009

[38] [38]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Sept. 2020. arXiv:1802.03426 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[39] [39]

Mishkin, L

P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 Preview - Risks and Limitations, 2022

work page 2022

[40] [40]

Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4885–4901, Online, July 2020. Association for Computational Linguistics

work page 2020

[41] [41]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, Mar

work page

[42] [42]

arXiv:2203.02155 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs], Feb. 2022. arXiv: 2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hen- dricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Hig- gins, A. Creswell, N. McAleese, A. Wu, E. El...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [45]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Gen- eration with CLIP Latents, Apr. 2022. arXiv:2204.06125 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics

work page 2020

[47] [47]

Röttger, B

P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert. HateCheck: Func- tional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online,...

work page 2021

[48] [48]

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social Bias Frames: Reasoning about Social and Power Implications of Language. arXiv:1911.03891 [cs], Apr. 2020. arXiv: 1911.03891

work page arXiv 1911

[49] [49]

Process for adapting language models to society (PALMS) with values-targeted datasets

I. Solaiman and C. Dennison. Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets. arXiv:2106.10328 [cs], Nov. 2021. arXiv: 2106.10328

work page arXiv 2021

[50] [50]

Steiger, T

M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, and M. Lease. The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI ’21, pages 1–14, New York, NY , USA, May 2021. Association for Compu...

work page 2021

[51] [51]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, Feb. 2014. arXiv:1312.6199 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2014

[52] [52]

& Ganguli, D

A. Tamkin, M. Brundage, J. Clark, and D. Ganguli. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv:2102.02503 [cs], Feb. 2021. arXiv: 2102.02503

work page arXiv 2021

[53] [53]

E. R. Thompson. Development and Validation of an Internationally Reliable Short-Form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38(2):227–242, Mar

work page

[54] [54]

Publisher: SAGE Publications Inc

work page

[55] [55]

LaMDA: Language Models for Dialog Applications

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. Meier-Hellstern, M. R. Morris, T. Do...

work page Pith review arXiv 2022

[56] [56]

C. US. U.S. Census Bureau QuickFacts: United States, July 2021

work page 2021

[57] [57]

Wallace, A

E. Wallace, A. Williams, R. Jia, and D. Kiela. Analyzing Dynamic Adversarial Training Data in the Limit, Oct. 2021. arXiv:2110.08514 [cs]

work page arXiv 2021

[58] [58]

Watson, L

D. Watson, L. A. Clark, and A. Tellegen. Development and validation of brief measures of positive and negative affect: the PANAS scales. Journal of Personality and Social Psychology , 54(6):1063–1070, June 1988

work page 1988

[59] [59]

Ethical and social risks of harm from Language Models

L. Weidinger, J. Mellor, M. Rauh, C. Grifﬁn, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs], Dec. 20...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

Welbl, A

J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Cop- pin, and P.-S. Huang. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2447–2469, Punta Cana, Dominican Republic, Nov

work page 2021

[61] [61]

Association for Computational Linguistics

work page

[62] [62]

H. Xu, Y . Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K. Jain. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review, Oct. 2019. arXiv:1909.08072 [cs, stat]. 29

work page arXiv 2019

[63] [63]

J. Xu, D. Ju, M. Li, Y .-L. Boureau, J. Weston, and E. Dinan. Bot-Adversarial Dialogue for Safe Conversational Agents. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y . Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

work page 2021

[64] [64]

D. M. Ziegler, S. Nix, L. Chan, T. Bauman, P. Schmidt-Nielsen, T. Lin, A. Scherlis, N. Nabeshima, B. Weinstein-Raun, D. de Haas, B. Shlegeris, and N. Thomas. Adversarial Training for High-Stakes Reliability, May 2022. arXiv:2205.01663 [cs]. 30

work page arXiv 2022