Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.
citing papers explorer
-
Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
-
Building Better Activation Oracles
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.