Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs
A parameter-free decomposition in MoE models separates routing control from content, showing that expert trajectories cluster tokens by semantic function across languages and forms, making paths rather than experts the natural unit of interpretability.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.