Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol
Pith reviewed 2026-06-30 14:02 UTC · model grok-4.3
The pith
A four-variant decomposition using template controls and difference-in-differences isolates alignment-induced activation shifts from chat formatting effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The naive aligned-minus-base activation matrix conflates alignment effects with chat template formatting. A four-variant decomposition separates these, with template-controlled and DiD variants yielding lower effective ranks and higher cosine similarity to the refusal direction (0.50-0.86 vs 0.18-0.39). Projection ablation shows the recovered subspace affects behavior, and singular values do not indicate causal importance.
What carries the argument
The four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences DiD).
If this is right
- Template control removes a 2.0-3.9x inflation in measured effective rank.
- The DiD contrast recovers the refusal direction with cosine alignment of 0.50-0.86.
- Projection-ablation confirms the recovered subspace is behaviorally active.
- Singular-value order is not causal order.
Where Pith is reading between the lines
- Studies measuring activation changes from other fine-tuning regimes could apply the same controlled decomposition to avoid template confounds.
- Re-evaluating prior activation-difference results with this protocol may revise conclusions about the geometry of safety training.
- The method supplies a practical checklist that could be adopted as standard practice for activation-difference experiments.
Load-bearing premise
The four control variants cleanly separate chat-template effects from alignment-induced shifts without introducing new confounding.
What would settle it
If ablating the DiD-recovered direction on held-out refusal prompts produces no larger drop in refusal rate than ablating the naive direction.
Figures
read the original abstract
Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that naive activation-difference matrices between aligned and base models are confounded by chat-template effects, and introduces a four-variant decomposition (naive, template-controlled, within-aligned, and difference-in-differences) to isolate alignment-induced shifts. Template control alone reduces measured effective rank inflation by 2.0-3.9x; the DiD contrast recovers the refusal direction of Arditi et al. (2024) with cosine similarity lifted from 0.18-0.39 to 0.50-0.86 across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B. Projection-ablation confirms the recovered subspace is behaviorally active, and the protocol is validated on a controlled testbed before distilling into measurement recommendations.
Significance. If the DiD protocol is unbiased, this is a significant methodological advance for activation-based interpretability of alignment. It directly improves recovery of known directions such as refusal and supplies concrete, actionable recommendations that could reduce confounds in a large body of follow-on work. The multi-family empirical results and testbed validation are concrete strengths that would make the protocol worth adopting if the additivity assumption is shown to hold.
major comments (2)
- [Abstract] Abstract / four-variant decomposition: the DiD estimator is defined as (aligned_with_template − aligned_without) − (base_with_template − base_without) and is presented as recovering an unbiased alignment direction. This identification requires additive separability of template and alignment effects. The manuscript reports no direct test or quantification of interaction magnitude on the three evaluated model families, leaving open the possibility of residual confounding if alignment alters template processing.
- [Validation on controlled testbed] Validation on controlled testbed: while the protocol is validated on a controlled testbed, the abstract does not indicate whether the testbed includes synthetic interaction terms between template and alignment to verify that the DiD estimator remains unbiased when the additivity assumption is violated under realistic conditions.
minor comments (1)
- The abstract would be clearer if it briefly noted the core identifying assumption (additive separability) and any limitations of the DiD approach.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the identification assumptions of the DiD protocol. We address each major comment below and commit to revisions that directly incorporate the suggested clarifications and extensions.
read point-by-point responses
-
Referee: [Abstract] Abstract / four-variant decomposition: the DiD estimator is defined as (aligned_with_template − aligned_without) − (base_with_template − base_without) and is presented as recovering an unbiased alignment direction. This identification requires additive separability of template and alignment effects. The manuscript reports no direct test or quantification of interaction magnitude on the three evaluated model families, leaving open the possibility of residual confounding if alignment alters template processing.
Authors: We agree that the DiD estimator is identified under additive separability of template and alignment effects. The controlled testbed validates unbiased recovery when this assumption holds, and the multi-family results show improved cosine similarity to the refusal direction (0.50-0.86) together with behavioral relevance under projection ablation. We did not provide a direct quantification of interaction magnitude on the three real model families. In the revision we will add such an analysis, for example by comparing the DiD contrast against the within-aligned contrast and by inspecting activation residuals for systematic non-additivity patterns across the evaluated families. revision: yes
-
Referee: [Validation on controlled testbed] Validation on controlled testbed: while the protocol is validated on a controlled testbed, the abstract does not indicate whether the testbed includes synthetic interaction terms between template and alignment to verify that the DiD estimator remains unbiased when the additivity assumption is violated under realistic conditions.
Authors: The testbed was designed to confirm that the four-variant decomposition recovers known alignment effects when template and alignment contributions are additive. It does not currently include synthetic interaction terms to assess bias under violations of additivity. We will revise the abstract to clarify the testbed's scope and extend the testbed with controlled interaction simulations, allowing direct verification of estimator behavior when the assumption is violated. revision: yes
Circularity Check
No circularity: statistical decomposition is self-contained
full rationale
The paper defines a four-variant activation-difference protocol (naive, template-controlled, within-aligned, DiD) via explicit matrix subtractions on observed activations. The DiD contrast is introduced as a standard econometric decomposition applied to the data; it does not reduce to a fitted parameter, self-citation, or tautological renaming. Reported cosine gains and rank reductions are empirical measurements on three model families, not forced by the protocol equations themselves. No load-bearing self-citations or ansatzes appear in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback
arXiv preprint arXiv:2212.08073. Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv preprint arXiv:2306.03819. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martin Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649:584–589,
-
[3]
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P
arXiv preprint arXiv:1912.05671. Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems,
-
[4]
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
arXiv preprint arXiv:1802.10026. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. Alignment faking in large language models
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Alignment faking in large language models
arXiv preprint arXiv:2412.14093. Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023
arXiv preprint arXiv:2308.09124. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,
-
[7]
LoRA: Low-Rank Adaptation of Large Language Models
arXiv preprint arXiv:2106.09685. Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, et al. Sleeper agents: Training deceptive LLMs that persist through safety training
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
arXiv preprint arXiv:2401.05566. Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal, and Puneet K. Dokania. What makes and breaks safety fine-tuning? A mechanistic study. InAdvances in Neural Information Processing Systems, volume 37,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2310.20624. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM),
-
[10]
Zoom In : An Introduction to Circuits
doi: 10.23915/distill.00024.001. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35,
-
[11]
The Linear Representation Hypothesis and the Geometry of Large Language Models
arXiv preprint arXiv:2311.03658. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to. InInternational Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet
Adly Templeton, Tom Conerly, Jonathan Marcus, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.https://transformer-circuits. pub/2024/scaling-monosemanticity/. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steerin...
2024
-
[13]
Steering Language Models With Activation Engineering
arXiv preprint arXiv:2308.10248. Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Jo- hannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InInternational Conference on Machine Learning,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2310.01405. A Effective rank: bootstrap stability and sample-size sweep We bootstrap-resample then=200paired safety samples with replacement (200resamples per layer) and recomputerank ϵonM template atϵ=0.05. The resampling distribution is tight (±1to±2across all layers and families) and centered at or below the full-sample (without-re...
work page internal anchor Pith review Pith/arXiv arXiv 1927
-
[15]
(2025), though the per-direction data alone do not exclude alternatives
attributes thek=5rebound to ablatingu1 together with the inertu2 and a weak di- rection, with full collapse requiring the wider-band block—a redundancy reading consistent with the concept cones of Wollschläger et al. (2025), though the per-direction data alone do not exclude alternatives. Table 9: Llama narrow-band ablation, refusal rate (Wilson95%CI),nge...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.