pith. machine review for the scientific record. sign in

citation dossier

Jailbroken: How does llm safety training fail?

A · 2023 · arXiv 2307.02483

19Pith papers citing it
19reference links
cs.CRtop field · 6 papers
UNVERDICTEDtop verdict bucket · 15 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.CR (6 papers). The largest review-status bucket among citing papers is UNVERDICTED (15 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.

citing papers explorer

Showing 19 of 19 citing papers.