hub

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al · 2025 · arXiv 2501.18837

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Toward a Principled Framework for Agent Safety Measurement

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

Leveraging RAG for Training-Free Alignment of LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.

Internalizing Safety Understanding in Large Reasoning Models via Verification

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

Estimating Tail Risks in Language Model Output Distributions

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.

Human-Guided Harm Recovery for Computer Use Agents

cs.AI · 2026-04-20 · conditional · novelty 6.0

Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

cs.CL · 2026-04-16 · unverdicted · novelty 6.0

A new segment-level coherence probing method improves true-positive rate for harmful intent detection by 35.55% at 1% false-positive rate and maintains high AUROC on obfuscated attacks.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

Do Linear Probes Generalize Better in Persona Coordinates?

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.

A Systematic Investigation of The RL-Jailbreaker in LLMs

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

Dense rewards and extended episode lengths in the RL jailbreaking framework are the primary drivers of successful attacks on LLMs.

citing papers explorer

Showing 15 of 15 citing papers.

Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 46
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 10
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 61 · 2 links
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Toward a Principled Framework for Agent Safety Measurement cs.CR · 2026-05-02 · unverdicted · none · ref 15
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 54
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
Internalizing Safety Understanding in Large Reasoning Models via Verification cs.AI · 2026-05-09 · unverdicted · none · ref 17
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 29
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
Estimating Tail Risks in Language Model Output Distributions cs.LG · 2026-04-24 · unverdicted · none · ref 36
Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · conditional · none · ref 3
Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
Segment-Level Coherence for Robust Harmful Intent Probing in LLMs cs.CL · 2026-04-16 · unverdicted · none · ref 3
A new segment-level coherence probing method improves true-positive rate for harmful intent detection by 35.55% at 1% false-positive rate and maintains high AUROC on obfuscated attacks.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 38
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 88
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 31
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 19
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
A Systematic Investigation of The RL-Jailbreaker in LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 14
Dense rewards and extended episode lengths in the RL jailbreaking framework are the primary drivers of successful attacks on LLMs.

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer