Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.