Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Test-Time Safety Alignment

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

cs.AI · 2026-04-29 · unverdicted · novelty 5.0

LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

citing papers explorer

Showing 2 of 2 citing papers.

Test-Time Safety Alignment cs.CL · 2026-04-28 · unverdicted · none · ref 23
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control cs.AI · 2026-04-29 · unverdicted · none · ref 59
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

fields

years

verdicts

representative citing papers

citing papers explorer