pith. sign in

hub Baseline reference

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.

11 Pith papers citing it
Baseline 57% of classified citations

hub tools

citation-role summary

dataset 4 background 3

citation-polarity summary

clear filters

representative citing papers

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

citing papers explorer

Showing 5 of 5 citing papers after filters.