LLM-Safety Evaluations Lack Robustness

Gauthier Gidel; Leo Schwinn; Simon Geisler; Sophie Xhonneux; Stephan G\"unnemann; Tim Beyer

arxiv: 2503.02574 · v2 · pith:XWX6Y5XSnew · submitted 2025-03-04 · 💻 cs.CR · cs.AI

LLM-Safety Evaluations Lack Robustness

Tim Beyer , Sophie Xhonneux , Simon Geisler , Gauthier Gidel , Leo Schwinn , Stephan G\"unnemann This is my paper

classification 💻 cs.CR cs.AI

keywords evaluationevaluationsfuturemakenoisepracticalprogressresearch

0 comments

read the original abstract

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack
cs.CR 2026-05 unverdicted novelty 6.0

LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
cs.CL 2026-04 unverdicted novelty 6.0

LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
cs.CR 2025-12 unverdicted novelty 6.0

A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
cs.CR 2025-06 unverdicted novelty 6.0

Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.