Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
super hub Canonical reference
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to re
authors
co-cited works
representative citing papers
Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.
Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.
Releases the first public safety evaluation dataset for Albanian LLMs with 2,951 prompts spanning 11 categories including self-harm, violence, and radicalization.
Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
PCAP conditions adversarial searches on attacker personas to raise attack success rates from ~58% to ~97% on large models while increasing prompt diversity.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
Direction-flipped influence audits show contextual cues shift LLM moral choices by 12-18 points on average across multiple benchmarks, revealing asymmetries, backfires, and inconsistencies in 40% of conditions.
citing papers explorer
-
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
-
One Year Later...The Harms Persist, But So Do We!
LLM safety guardrails fail for most mental health conditions with up to 100% failure rates for eating disorders, substance use disorder, and major depressive disorder, while holding only for suicide and self-harm.
-
SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data
SpecAlign synthesizes boundary-aware preference pairs directly from structured model specifications to train LLMs for improved rule compliance.
-
Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails
An audit finds language model filters and guardrails disproportionately suppress mentions of marginalized groups via lexical cues while failing to catch explicit harms.
-
Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.
-
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
-
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?
Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.
-
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
-
Soft Specialists: $\alpha$-R\'enyi Ensembles for Uncertainty-Aware LLM Post-Training
An α-Rényi variational ensemble method learns distributions over LoRA adapter parameters for uncertainty-aware LLM post-training, balancing individual model plausibility with complementary specialization.
-
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
-
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
Surrogate modeling for interpreting black-box LLMs in medical predictions
A surrogate modeling method approximates LLM-encoded medical knowledge via prompting to quantify variable influence and flag inaccuracies and racial biases.
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
-
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
-
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
-
Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications
An automated self-testing framework with evidence-based quality gates for LLM application releases was evaluated in a longitudinal case study of a multi-agent conversational AI system, identifying rollback builds and supporting stable quality over four weeks.
-
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
-
Users as Annotators: LLM Preference Learning from Comparison Mode
Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
-
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks
Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.
-
CASE: An Agentic AI Framework for Enhancing Scam Intelligence in Digital Payments
CASE is a novel agentic AI system that proactively interviews scam victims using LLMs to collect detailed intelligence, which is then structured for use in scam prevention, resulting in a 21% increase in enforcements on Google Pay India.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
Introduces a multi-role red teaming framework using attacker and jury models that increases attack success rates by up to 7.9% on LLM faithfulness in question-answering tasks.
-
The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI
Explores options for using LLMs to scale deliberation and empower marginalized groups via systemic-functional linguistics concepts while cautioning against over- and under-claiming.
-
The Case for Model Science: Verify, Explore, Steer, Refine
Position paper proposing Model Science as a discipline to systematically analyze AI model behavior beyond benchmarks, drawing analogies from cognitive science, neuroscience, medicine, and agriculture.
-
An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding
Empirical audit of k-NAF in Anchored Decoding finds budgets are not exhausted on tested workloads, with high proxy ratios attributable to small-sample artifacts.
-
Position: AI Safety Requires Effective Controllability
Position paper claiming that AI safety requires explicit runtime controllability and introducing ControlBench to demonstrate gaps in existing alignment methods.
-
Responsible Agentic AI Requires Explicit Provenance
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
-
Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration
A review synthesizes affective dynamics as a coordination layer in human-AI agent collaboration and proposes a framework for trust calibration, delegation, error correction, and governance.
-
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI needs contextual multi-objective optimization to select and balance multiple context-dependent objectives rather than relying on single stable goals.
-
Responsible Federated LLMs via Safety Filtering and Constitutional AI
Integrates safety filtering and constitutional AI into FedLLM, reporting over 20% safety improvement on AdvBench.
-
OpenAI o1 System Card
OpenAI reports that chain-of-thought reasoning in o1 models enables deliberative alignment, yielding state-of-the-art results on selected safety benchmarks for illicit advice, stereotypes, and jailbreaks.
-
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Online Safety Monitoring for LLMs
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.
-
AI Alignment From Social Choice Perspectives
This survey examines applications of social choice theory to aggregating human feedback in AI alignment, identifying failure modes and expanding design options for disagreement.
-
Understanding Censorship in Large Language Models: From Mechanisms to Governance
Synthesizes mechanisms of LLM censorship across the model lifecycle and argues that the key issue is making moderation proportionate, accountable, pluralistic, and auditable rather than debating whether moderation should occur.
-
AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue
AERIC uses a 387-parameter head on LLM hidden states for same-pass anticipatory detection of implicit harm, reporting AUROC gains on DiaSafety and Harmful Advice plus low-latency trigger rates on HarmBench and SocialHarmBench.
-
Brainrot: Deskilling and Addiction are Overlooked AI Risks
AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
SAGE Celer 2.6 Technical Card
SAGE Celer 2.6 is a new line of language models with inverse reasoning training, integrated vision, and strong performance on math, coding, and South Asian language benchmarks.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- RLHF May Not Reflect Genuine Preferences
- Reinforcement Learning from Human Feedback