Recognition: 2 theorem links
· Lean TheoremRed Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Pith reviewed 2026-05-12 01:32 UTC · model grok-4.3
The pith
RLHF-trained language models become progressively harder to red-team into harmful outputs as they scale up in size, while other training approaches show no such improvement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across the tested model sizes and types, RLHF models show a clear increase in resistance to red team attacks as parameter count grows, whereas plain LMs, prompted LMs, and rejection-sampling LMs exhibit flat trends in attack success rate with scale. The work further catalogs a wide range of elicited harms, from overt offensive language to subtler non-violent unethical content, and supplies the complete attack dataset together with detailed methodology for community use.
What carries the argument
Comparative red-teaming success rate measured across four model training regimes (plain LM, prompted HH, rejection sampling, RLHF) at three parameter scales, with the RLHF regime as the variable that produces the observed scaling improvement in resistance.
If this is right
- Larger RLHF models will likely need more advanced or automated red-teaming methods to continue uncovering residual harms.
- The released attack dataset supplies a public benchmark that future safety methods can be measured against.
- Training regimes other than RLHF do not appear to confer the same scaling advantage in resistance to attack.
- Transparency in red-teaming procedures enables shared standards for evaluating model safety across labs.
Where Pith is reading between the lines
- If the scaling pattern holds, RLHF-style training may provide a practical route for safety to improve alongside raw capability at larger scales.
- The flat trends for non-RLHF models suggest that prompt engineering or rejection sampling alone are unlikely to close the safety gap as models grow.
- The methods could be extended to test whether similar scaling resistance appears in multimodal or agentic systems trained with comparable feedback.
Load-bearing premise
The particular red-teaming instructions, prompts, and attack strategies used in the study are comprehensive enough to surface most or all of the harmful behaviors these models can exhibit.
What would settle it
A follow-up experiment that applies the same or closely matched attack distribution to a substantially larger RLHF model (for example 100B+ parameters) and measures an attack success rate that does not continue to decline, or that rises, would falsify the reported scaling trend for RLHF.
read the original abstract
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes early efforts to red team language models to discover, measure, and reduce harmful outputs. It makes three contributions: (1) an investigation of scaling behaviors for red teaming success across three model sizes (2.7B, 13B, 52B) and four model types (plain LM, prompted helpful/honest/harmless, rejection sampling, and RLHF), finding that RLHF models become increasingly difficult to red team with scale while the other types show flat trends; (2) release of a dataset containing 38,961 red team attacks; and (3) detailed descriptions of instructions, processes, statistical methodologies, and uncertainties, along with analysis of the harmful outputs elicited (ranging from offensive language to subtle unethical behaviors).
Significance. If the scaling trends prove robust, the work supplies concrete empirical data on how alignment methods like RLHF affect vulnerability to adversarial elicitation of harms, informing safer deployment of larger models. The public release of the large attack dataset is a clear asset that enables independent verification and further research on red teaming techniques. The paper's emphasis on methodological transparency and explicit discussion of uncertainties is a positive contribution toward community standards in AI safety evaluation.
major comments (1)
- [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.
minor comments (2)
- [Methods] Methods section: While the paper states it exhaustively describes statistical methodologies, adding explicit formulas or pseudocode for how red team success rates and uncertainty estimates were computed (including any adjustments for multiple comparisons across model sizes) would improve reproducibility.
- [Dataset] Dataset description: The release of 38,961 attacks is valuable, but the paper would benefit from additional metadata on red teamer demographics, experience levels, and any training provided, to allow readers to assess potential sources of bias in the attack distribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the paper's contributions to red teaming methods and the public dataset release. We address the major comment below.
read point-by-point responses
-
Referee: [Scaling behaviors] Scaling behaviors section (and abstract claim): The central result that RLHF models are increasingly difficult to red team with scale, while other model types remain flat, assumes consistent red teaming effort and strategy across conditions. The manuscript does not report per-model metrics on attack persistence (e.g., average turns per conversation, number of unique prompt variants tried, or stopping criteria) or indicate whether red teamers were blinded to model identity or type. Without such controls, lower success rates on larger RLHF models could reflect differences in human effort or adaptation rather than intrinsic scaling of refusal behavior. Although the released dataset permits post-hoc checks, the paper should include an analysis of effort-related statistics across the four model types to support the scaling interpretation.
Authors: We thank the referee for highlighting this important potential confound in our scaling analysis. We agree that the absence of reported effort metrics leaves room for alternative interpretations. Our red teaming protocol used identical instructions, attack strategies, and stopping criteria for all model types and sizes, as described in the methods. However, we did not report per-model statistics on conversation length or prompt variants, and red teamers were not blinded to model identity. To address this directly, we will perform a post-hoc analysis of the released dataset of 38,961 attacks to compute effort-related metrics (average turns, unique variants attempted) broken down by model type and size, and include these results in the revised manuscript. This addition will support that the RLHF scaling trend reflects model behavior rather than differences in human effort. revision: yes
Circularity Check
No significant circularity; empirical scaling trends derived from direct measurements
full rationale
The paper reports empirical results from human red-teaming experiments across model scales and types, with the central claim (RLHF models become harder to red-team with scale while others show flat trends) resting on observed attack success rates in the released dataset of 38,961 attacks. No mathematical derivations, fitted parameters renamed as predictions, or self-citations are used to establish the scaling behaviors; the trends follow directly from the collected data without reduction to prior inputs or definitions. Self-citations appear only for background methods and are not load-bearing for the scaling observations. The work is self-contained against external benchmarks via the public dataset, which permits independent verification.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human evaluators can reliably identify harmful outputs from language models
Forward citations
Cited by 43 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery
PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
-
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
-
From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
Paired analysis of 1250 LLM interactions shows 61% of responses de-escalate harm, 36% maintain severity, and 3% escalate, with sexual content persisting far more than other categories.
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
-
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
-
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
-
AVISE: Framework for Evaluating the Security of AI Systems
AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
-
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
-
Jailbroken: How Does LLM Safety Training Fail?
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Surrogate modeling for interpreting black-box LLMs in medical predictions
A surrogate modeling method approximates LLM-encoded medical knowledge via prompting to quantify variable influence and flag inaccuracies and racial biases.
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
-
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
-
Brainrot: Deskilling and Addiction are Overlooked AI Risks
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
A. Abid, M. Farooqi, and J. Zou. Large language models associate Muslims with violence. Nature Machine Intelligence, 3(6):461–463, June 2021. Number: 6 Publisher: Nature Publishing Group
work page 2021
-
[2]
A General Language Assistant as a Laboratory for Alignment
A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, and J. Kaplan. A General Language Assistant as a Labora- tory for Alignment. arXiv:2112.00861 [cs], Dec. 2021. arXi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
S. Avin, H. Belfield, M. Brundage, G. Krueger, J. Wang, A. Weller, M. Anderljung, I. Krawczuk, D. Krueger, J. Lebensold, T. Maharaj, and N. Zilberman. Filling gaps in trustworthy development of AI. Science, Dec. 2021. Publisher: American Association for the Advancement of Science
work page 2021
-
[4]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
P. Barrett. Research Highlights | Who Moderates the Social Media Giants? A Call to End Outsourcing - NYU Stern
-
[6]
Improving question answering model robustness with synthetic adversarial data generation
M. Bartolo, T. Thrush, R. Jia, S. Riedel, P. Stenetorp, and D. Kiela. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 8830–8848, 2021. arXiv:2104.08678 [cs]
- [7]
- [8]
-
[9]
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InProceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, pages 610–623, New York, NY , USA, Mar. 2021. Association for Computing Machinery
work page 2021
-
[10]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...
-
[12]
B. Buchanan, A. Lohn, M. Musser, and K. Sedova. Truth, Lies, and Automation, May 2021
work page 2021
-
[13]
Extracting training data from large language models
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-V oss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel. Extracting Training Data from Large Language Models. arXiv:2012.07805 [cs], June 2021. arXiv: 2012.07805
-
[14]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep Reinforcement Learning from Human Preferences. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017
work page 2017
- [15]
-
[16]
A. Das, B. Dang, and M. Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):33–42, Oct. 2020
work page 2020
- [17]
-
[18]
Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser
E. Dinan, G. Abercrombie, A. S. Bergman, S. Spruit, D. Hovy, Y .-L. Boureau, and V . Rieser. Antici- pating Safety Issues in E2E Conversational AI: Framework and Tooling. arXiv:2107.03451 [cs], July
-
[19]
E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston. Queens are Powerful too: Mit- igating Gender Bias in Dialogue Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online, Nov. 2020. Association for Computational Linguistics
work page 2020
-
[20]
Build it break it fix it for dialogue safety: Robustness from adversarial human attack
E. Dinan, S. Humeau, B. Chintagunta, and J. Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, Aug. 2019. arXiv:1908.06083 [cs]
-
[21]
L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, pages 67–73, New York, NY , USA, Dec. 2018. Association for Computing Machinery
work page 2018
-
[22]
arXiv preprint arXiv:2202.07785 , year=
D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield- Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark. Predictabil...
-
[23]
S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual Fairness in Text Classification through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 219–226, New York, NY , USA, Jan. 2019. Association for Computing Machinery
work page 2019
-
[24]
doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]
T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], Dec. 2021. arXiv: 1803.09010
-
[25]
Realtoxicityprompts: Evaluating neural toxic degeneration in language models
S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. ArXiv, abs/2009.11462, 2020
- [26]
-
[27]
E. A. Holmes, E. L. James, T. Coode-Bate, and C. Deeprose. Can Playing the Computer Game “Tetris” Reduce the Build-Up of Flashbacks for Trauma? A Proposal from Cognitive Science. PLOS ONE, 4(1):e4153, Jan. 2009. Publisher: Public Library of Science
work page 2009
-
[28]
B. Hutchinson, V . Prabhakaran, E. Denton, K. Webster, Y . Zhong, and S. Denuyl. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5491–5501, Online, July 2020. Association for Computational Linguistics
work page 2020
- [29]
-
[30]
Y . Jiang and M. Bansal. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2726–2736, Florence, Italy, July 2019. Association for Computational Linguistics
work page 2019
-
[31]
S. Karunakaran and R. Ramakrishan. Testing Stylistic Interventions to Reduce Emotional Impact of Content Moderation Workers.Proceedings of the AAAI Conference on Human Computation and Crowd- sourcing, 7:50–58, Oct. 2019
work page 2019
-
[32]
D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin...
work page 2021
- [33]
- [34]
-
[35]
S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs], Sept. 2021. arXiv: 2109.07958
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [36]
-
[37]
The radicalization risks of GPT-3 and advanced neural language models
K. McGuffie and A. Newhouse. The Radicalization Risks of GPT-3 and Advanced Neural Language Models. arXiv:2009.06807 [cs], Sept. 2020. arXiv: 2009.06807
-
[38]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Sept. 2020. arXiv:1802.03426 [cs, stat]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[39]
P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 Preview - Risks and Limitations, 2022
work page 2022
-
[40]
Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4885–4901, Online, July 2020. Association for Computational Linguistics
work page 2020
-
[41]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, Mar
-
[42]
arXiv:2203.02155 [cs]
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Red Teaming Language Models with Language Models
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red Teaming Language Models with Language Models. arXiv:2202.03286 [cs], Feb. 2022. arXiv: 2202.03286
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hen- dricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Hig- gins, A. Creswell, N. McAleese, A. Wu, E. El...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical Text-Conditional Image Gen- eration with CLIP Latents, Apr. 2022. arXiv:2204.06125 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics
work page 2020
-
[47]
P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert. HateCheck: Func- tional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online,...
work page 2021
- [48]
-
[49]
I. Solaiman and C. Dennison. Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets. arXiv:2106.10328 [cs], Nov. 2021. arXiv: 2106.10328
-
[50]
M. Steiger, T. J. Bharucha, S. Venkatagiri, M. J. Riedl, and M. Lease. The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI ’21, pages 1–14, New York, NY , USA, May 2021. Association for Compu...
work page 2021
-
[51]
Intriguing properties of neural networks
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, Feb. 2014. arXiv:1312.6199 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[52]
Understanding the capabilities, limitations, and societal impact of large language models
A. Tamkin, M. Brundage, J. Clark, and D. Ganguli. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv:2102.02503 [cs], Feb. 2021. arXiv: 2102.02503
-
[53]
E. R. Thompson. Development and Validation of an Internationally Reliable Short-Form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38(2):227–242, Mar
-
[54]
Publisher: SAGE Publications Inc
-
[55]
LaMDA: Language Models for Dialog Applications
R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, K. Meier-Hellstern, M. R. Morris, T. Do...
work page Pith review arXiv 2022
-
[56]
C. US. U.S. Census Bureau QuickFacts: United States, July 2021
work page 2021
-
[57]
E. Wallace, A. Williams, R. Jia, and D. Kiela. Analyzing Dynamic Adversarial Training Data in the Limit, Oct. 2021. arXiv:2110.08514 [cs]
- [58]
-
[59]
Ethical and social risks of harm from Language Models
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs], Dec. 20...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Cop- pin, and P.-S. Huang. Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 , pages 2447–2469, Punta Cana, Dominican Republic, Nov
work page 2021
-
[61]
Association for Computational Linguistics
- [62]
-
[63]
J. Xu, D. Ju, M. Li, Y .-L. Boureau, J. Weston, and E. Dinan. Bot-Adversarial Dialogue for Safe Conversational Agents. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y . Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...
work page 2021
- [64]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.