Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.
citing papers explorer
-
Test-Time Safety Alignment
Optimizing input embeddings sub-lexically via black-box zeroth-order gradients neutralizes all safety-flagged responses from aligned models on standard benchmarks.
-
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.