OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
unclear 1representative citing papers
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
citing papers explorer
-
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
-
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.