hub Mixed citations

OR- Bench: An over-refusal benchmark for large language models

Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh · 2024 · arXiv 2405.20947

Mixed citation behavior. Most common role is background (60%).

19 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1

citation-polarity summary

background 3 unclear 1 use dataset 1

representative citing papers

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Triaging Threats to Specialized Guardrails

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

cs.SE · 2025-08-15 · unverdicted · novelty 6.0

ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

cs.CR · 2025-06-02 · unverdicted · novelty 5.0

ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

Discriminatory Compliance: How LLMs Answer Queries from Protected Groups

cs.CY · 2026-06-19 · unverdicted · novelty 4.0

State-of-the-art LLMs respond inconsistently to queries from protected-group personas, with some responses omitting key information that should be provided.

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

cs.LG · 2026-05-28 · unverdicted · novelty 4.0

Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.

Knowledge Distillation Must Account for What It Loses

cs.LG · 2026-04-28 · unverdicted · novelty 4.0 · 2 refs

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

LLM-Safety Evaluations Lack Robustness

cs.CR · 2025-03-04 · unverdicted · novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

cs.AI · 2026-04-09

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

cs.CR · 2026-04-01

citing papers explorer

Showing 19 of 19 citing papers.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts cs.SE · 2026-05-20 · conditional · none · ref 14
RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 8
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 14
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 4
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery cs.CR · 2026-04-21 · unverdicted · none · ref 32
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 16
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Triaging Threats to Specialized Guardrails cs.CR · 2026-05-29 · unverdicted · none · ref 5
Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 49
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 7
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Robust Policy Optimization to Prevent Catastrophic Forgetting cs.LG · 2026-02-09 · unverdicted · none · ref 13
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal cs.SE · 2025-08-15 · unverdicted · none · ref 8
ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 15
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 56
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
Discriminatory Compliance: How LLMs Answer Queries from Protected Groups cs.CY · 2026-06-19 · unverdicted · none · ref 104
State-of-the-art LLMs respond inconsistently to queries from protected-group personas, with some responses omitting key information that should be provided.
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content cs.LG · 2026-05-28 · unverdicted · none · ref 6
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
Knowledge Distillation Must Account for What It Loses cs.LG · 2026-04-28 · unverdicted · none · ref 41 · 2 links
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 19
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 10
SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits cs.CR · 2026-04-01 · unreviewed · ref 5

OR- Bench: An over-refusal benchmark for large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer