hub Mixed citations

OR- Bench: An over-refusal benchmark for large language models

Or-bench: An over-refusal benchmark for large language models , author= · 2025 · arXiv 2405.20947

Mixed citation behavior. Most common role is background (60%).

32 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 32 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1

citation-polarity summary

background 3 unclear 1 use dataset 1

representative citing papers

FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences

cs.LG · 2026-06-02 · unverdicted · novelty 8.0

FLIPS identifies LLM instances with 96% closed-set and 90% open-set accuracy by exploiting biases in generated binary random sequences across 237 instances.

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

cs.AI · 2026-06-01 · conditional · novelty 8.0

Current benchmarks overlook abstention competence in agents due to compliance bias; a new three-gap taxonomy and metrics (Safety Rate, Usability Rate, Informed Refusal Rate) demonstrate tunable safety-usability tradeoffs in preliminary tests across five model families.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

PsychoSafe is a psychologically-informed refusal framework that improves LLM refusal quality by 28.1% via prompting and fine-tuning on an 8019-pair corpus across five risk domains, with strong in-domain but limited out-of-domain results.

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

CHASE uses co-evolutionary RL with GRPO to harden LLMs against black-box prompt-rewriting attacks, cutting mean StrongREJECT scores by 43.2% on held-out families while keeping zero false refusals on benign prompts.

Triaging Threats to Specialized Guardrails

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

cs.CR · 2026-05-23 · unverdicted · novelty 6.0

Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

Robust Policy Optimization to Prevent Catastrophic Forgetting

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

cs.SE · 2025-08-15 · unverdicted · novelty 6.0

ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

cs.CL · 2026-06-07 · unverdicted · novelty 5.0

A comparative statistical framework is proposed to audit proprietary alignment in black-box LLMs by quantifying behavioral divergences from reference models rather than absolute correctness.

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

cs.AI · 2026-05-22 · unverdicted · novelty 5.0

Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

cs.CR · 2025-06-02 · unverdicted · novelty 5.0

ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

Discriminatory Compliance: How LLMs Answer Queries from Protected Groups

cs.CY · 2026-06-19 · unverdicted · novelty 4.0

State-of-the-art LLMs respond inconsistently to queries from protected-group personas, with some responses omitting key information that should be provided.

citing papers explorer

Showing 8 of 8 citing papers after filters.

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents cs.AI · 2026-06-01 · conditional · none · ref 4
Current benchmarks overlook abstention competence in agents due to compliance bias; a new three-gap taxonomy and metrics (Safety Rate, Usability Rate, Informed Refusal Rate) demonstrate tunable safety-usability tradeoffs in preliminary tests across five model families.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 14
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 4
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 49
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 7
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs cs.AI · 2026-05-22 · unverdicted · none · ref 51
Palette identifies refusal directions via multi-objective search, internalizes them through lightweight adaptation, and supports on-demand multi-domain authorization via independent learning and parameter merging.
The Ethics of LLM Sandbox and Persona Dynamics cs.AI · 2026-05-27 · unverdicted · none · ref 4
Argues that LLM guardrails generate unethical reality gaps by shifting epistemic risk to users and that ethical AI can become unethical when it prioritizes institutional reassurance over accurate perception.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 10

OR- Bench: An over-refusal benchmark for large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer