RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.
The art of saying no: Contex- tual noncompliance in language models
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LLM responses to moral judgment queries reinforce implicit humanization, potentially exacerbating overreliance and misplaced trust.
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
Frontier LLMs exhibit premature closure by selecting answers at high rates on medical tasks where the correct choice was removed and on open-ended queries, with safety prompting reducing but not eliminating the behavior.
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
citing papers explorer
-
Quantifying and Mitigating Premature Closure in Frontier LLMs
Frontier LLMs exhibit premature closure by selecting answers at high rates on medical tasks where the correct choice was removed and on open-ended queries, with safety prompting reducing but not eliminating the behavior.