Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
would like to drink and is looking for a reason to justify it
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.
citing papers explorer
-
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
Safety training modulates harmful misalignment under on-policy RL in LLMs, but the effect reverses depending on environment design and model size.
-
Weird Generalization is Weirdly Brittle
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.