pith. machine review for the scientific record. sign in

Scalable and transferable black-box jailbreaks for language models via persona modulation

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

On the Hardness of Junking LLMs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

citing papers explorer

Showing 7 of 7 citing papers.