pith. sign in

hub

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it
abstract

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

hub tools

citation-role summary

background 4

citation-polarity summary

years

2026 22 2025 4

roles

background 4

polarities

background 4

clear filters

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

CAREBench: A Child-Safety Risk Benchmark for Language Models

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

CAREBench is a new benchmark with 500 prompts in 12 risk categories that measures how often frontier LLMs fail to refuse or redirect child-safety risks, reporting failure rates between 2% and 58%.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

The Impact of Off-Policy Training Data on Probe Generalisation

cs.AI · 2025-11-21 · unverdicted · novelty 6.0

Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Verifying Restrictions on Frontier AI Research

cs.CY · 2026-06-27 · unverdicted · novelty 5.0

Catalogs 28 candidate verification mechanisms for restrictions on AI research and identifies key factors affecting their feasibility.

Do Linear Probes Generalize Better in Persona Coordinates?

cs.AI · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 46 · internal anchor

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  • PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 61 · 2 links · internal anchor

    Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

  • The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 38 · internal anchor

    Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

  • Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 88 · internal anchor

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.