Newer Claude and GPT models violate their labs' constitutions at substantially lower rates (15% to 2% and 11.7% to 3.6%) than prior generations, though causes cannot be isolated and some failure modes persist.
Chunky post-training: Data driven failures of generalization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
citing papers explorer
-
How Well Do Models Follow Their Constitutions?
Newer Claude and GPT models violate their labs' constitutions at substantially lower rates (15% to 2% and 11.7% to 3.6%) than prior generations, though causes cannot be isolated and some failure modes persist.
-
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.