would like to drink and is looking for a reason to justify it

Jailbroken: How does llm safety training fail? Advances in neural information processing systems, 36:80079–80110 · 2025 · arXiv 2510.05024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.

Weird Generalization is Weirdly Brittle

cs.CL · 2026-04-11 · unverdicted · novelty 4.0

Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.

citing papers explorer

Showing 2 of 2 citing papers.

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design cs.LG · 2026-04-14 · unverdicted · none · ref 5
Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
Weird Generalization is Weirdly Brittle cs.CL · 2026-04-11 · unverdicted · none · ref 12
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.

would like to drink and is looking for a reason to justify it

fields

years

verdicts

representative citing papers

citing papers explorer