LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
arXiv preprint arXiv:2601.08654 (2026)
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
citing papers explorer
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.