Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

Yuki Nakamura

arxiv: 2605.24583 · v3 · pith:YUKIMGA4new · submitted 2026-05-23 · 💻 cs.LG · cs.CL· stat.ML

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

Yuki Nakamura This is my paper

Pith reviewed 2026-06-30 14:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords activation differencesalignmentdifference-in-differenceschat templatesrefusal directioneffective ranksafety traininglanguage models

0 comments

The pith

A four-variant decomposition using template controls and difference-in-differences isolates alignment-induced activation shifts from chat formatting effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard before-and-after activation comparisons in aligned language models mix the effects of safety training with the chat templates that base models never encountered. The paper decomposes the aligned-minus-base matrix into four variants—naive, template-controlled, within-aligned, and difference-in-differences—to separate the two sources of change. Template control alone cuts the measured effective rank of the shift by 2.0-3.9 times across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B, while the DiD variant raises cosine similarity to the refusal direction from 0.18-0.39 to 0.50-0.86. Projection ablation confirms the recovered subspace influences refusal behavior and that singular-value rank does not determine behavioral impact. Accurate isolation matters for any study that treats activation differences as direct readouts of what alignment training changes inside a model.

Core claim

The naive aligned-minus-base activation matrix conflates alignment effects with chat template formatting. A four-variant decomposition separates these, with template-controlled and DiD variants yielding lower effective ranks and higher cosine similarity to the refusal direction (0.50-0.86 vs 0.18-0.39). Projection ablation shows the recovered subspace affects behavior, and singular values do not indicate causal importance.

What carries the argument

The four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences DiD).

If this is right

Template control removes a 2.0-3.9x inflation in measured effective rank.
The DiD contrast recovers the refusal direction with cosine alignment of 0.50-0.86.
Projection-ablation confirms the recovered subspace is behaviorally active.
Singular-value order is not causal order.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Studies measuring activation changes from other fine-tuning regimes could apply the same controlled decomposition to avoid template confounds.
Re-evaluating prior activation-difference results with this protocol may revise conclusions about the geometry of safety training.
The method supplies a practical checklist that could be adopted as standard practice for activation-difference experiments.

Load-bearing premise

The four control variants cleanly separate chat-template effects from alignment-induced shifts without introducing new confounding.

What would settle it

If ablating the DiD-recovered direction on held-out refusal prompts produces no larger drop in refusal rate than ablating the naive direction.

Figures

Figures reproduced from arXiv: 2605.24583 by Yuki Nakamura.

read the original abstract

Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The DiD decomposition cleans template confounding from activation differences and reports clear gains on rank and direction recovery, but rests on an untested additivity assumption.

read the letter

The main takeaway is that this paper supplies a four-variant decomposition (naive, template-controlled, within-aligned, DiD) that separates chat-template effects from alignment-induced activation changes. On three model families it shows template control alone cuts measured effective rank inflation by 2.0-3.9x and the full DiD contrast raises cosine similarity to the Arditi refusal direction from 0.18-0.39 to 0.50-0.86.

What is actually new is the specific template-controlled DiD applied to activation matrices; the abstract indicates this exact breakdown is not in the cited prior work. The paper does well by including a controlled testbed validation, projection-ablation checks that the recovered subspace is behaviorally active, and explicit measurement recommendations. Those elements make the protocol reproducible enough to try.

The soft spot is the core assumption that template-induced activation shifts are identical in base and aligned models. The DiD formula subtracts the two differences, which leaves residual confounding if alignment changes how the model handles the template (attention patterns, refusal circuitry, etc.). The abstract reports the gains but gives no direct test of effect additivity or interaction size on the three families. That is the load-bearing assumption and it is not yet stress-tested in the provided summary.

This is for researchers who run before-after activation comparisons in alignment interpretability. A reader who needs a practical fix for template confounding will find usable numbers and a clear protocol. It deserves peer review because the methodological issue is real, the empirical deltas are concrete, and the testbed step provides a starting point for checking the assumption.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that naive activation-difference matrices between aligned and base models are confounded by chat-template effects, and introduces a four-variant decomposition (naive, template-controlled, within-aligned, and difference-in-differences) to isolate alignment-induced shifts. Template control alone reduces measured effective rank inflation by 2.0-3.9x; the DiD contrast recovers the refusal direction of Arditi et al. (2024) with cosine similarity lifted from 0.18-0.39 to 0.50-0.86 across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B. Projection-ablation confirms the recovered subspace is behaviorally active, and the protocol is validated on a controlled testbed before distilling into measurement recommendations.

Significance. If the DiD protocol is unbiased, this is a significant methodological advance for activation-based interpretability of alignment. It directly improves recovery of known directions such as refusal and supplies concrete, actionable recommendations that could reduce confounds in a large body of follow-on work. The multi-family empirical results and testbed validation are concrete strengths that would make the protocol worth adopting if the additivity assumption is shown to hold.

major comments (2)

[Abstract] Abstract / four-variant decomposition: the DiD estimator is defined as (aligned_with_template − aligned_without) − (base_with_template − base_without) and is presented as recovering an unbiased alignment direction. This identification requires additive separability of template and alignment effects. The manuscript reports no direct test or quantification of interaction magnitude on the three evaluated model families, leaving open the possibility of residual confounding if alignment alters template processing.
[Validation on controlled testbed] Validation on controlled testbed: while the protocol is validated on a controlled testbed, the abstract does not indicate whether the testbed includes synthetic interaction terms between template and alignment to verify that the DiD estimator remains unbiased when the additivity assumption is violated under realistic conditions.

minor comments (1)

The abstract would be clearer if it briefly noted the core identifying assumption (additive separability) and any limitations of the DiD approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the identification assumptions of the DiD protocol. We address each major comment below and commit to revisions that directly incorporate the suggested clarifications and extensions.

read point-by-point responses

Referee: [Abstract] Abstract / four-variant decomposition: the DiD estimator is defined as (aligned_with_template − aligned_without) − (base_with_template − base_without) and is presented as recovering an unbiased alignment direction. This identification requires additive separability of template and alignment effects. The manuscript reports no direct test or quantification of interaction magnitude on the three evaluated model families, leaving open the possibility of residual confounding if alignment alters template processing.

Authors: We agree that the DiD estimator is identified under additive separability of template and alignment effects. The controlled testbed validates unbiased recovery when this assumption holds, and the multi-family results show improved cosine similarity to the refusal direction (0.50-0.86) together with behavioral relevance under projection ablation. We did not provide a direct quantification of interaction magnitude on the three real model families. In the revision we will add such an analysis, for example by comparing the DiD contrast against the within-aligned contrast and by inspecting activation residuals for systematic non-additivity patterns across the evaluated families. revision: yes
Referee: [Validation on controlled testbed] Validation on controlled testbed: while the protocol is validated on a controlled testbed, the abstract does not indicate whether the testbed includes synthetic interaction terms between template and alignment to verify that the DiD estimator remains unbiased when the additivity assumption is violated under realistic conditions.

Authors: The testbed was designed to confirm that the four-variant decomposition recovers known alignment effects when template and alignment contributions are additive. It does not currently include synthetic interaction terms to assess bias under violations of additivity. We will revise the abstract to clarify the testbed's scope and extend the testbed with controlled interaction simulations, allowing direct verification of estimator behavior when the assumption is violated. revision: yes

Circularity Check

0 steps flagged

No circularity: statistical decomposition is self-contained

full rationale

The paper defines a four-variant activation-difference protocol (naive, template-controlled, within-aligned, DiD) via explicit matrix subtractions on observed activations. The DiD contrast is introduced as a standard econometric decomposition applied to the data; it does not reduce to a fitted parameter, self-citation, or tautological renaming. Reported cosine gains and rank reductions are empirical measurements on three model families, not forced by the protocol equations themselves. No load-bearing self-citations or ansatzes appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the DiD contrast isolates alignment effects once template control is applied; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5762 in / 1129 out tokens · 21084 ms · 2026-06-30T14:02:01.738072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

arXiv preprint arXiv:2212.08073. Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martin Soto, Megha Srivastava, Nathan Labenz, and Owain Evans

arXiv preprint arXiv:2306.03819. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martin Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649:584–589,

work page arXiv
[3]

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P

arXiv preprint arXiv:1912.05671. Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems,

work page arXiv 1912
[4]

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

arXiv preprint arXiv:1802.10026. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. Alignment faking in large language models

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Alignment faking in large language models

arXiv preprint arXiv:2412.14093. Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

arXiv preprint arXiv:2308.09124. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,

work page arXiv
[7]

LoRA: Low-Rank Adaptation of Large Language Models

arXiv preprint arXiv:2106.09685. Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, et al. Sleeper agents: Training deceptive LLMs that persist through safety training

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv preprint arXiv:2401.05566. Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal, and Puneet K. Dokania. What makes and breaks safety fine-tuning? A mechanistic study. InAdvances in Neural Information Processing Systems, volume 37,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

LoRA fine-tuning efficiently undoes safety training in Llama 2-chat 70b.arXiv preprint arXiv:2310.20624, 2023

arXiv preprint arXiv:2310.20624. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM),

work page arXiv
[10]

Zoom In : An Introduction to Circuits

doi: 10.23915/distill.00024.001. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35,

work page doi:10.23915/distill.00024.001
[11]

The Linear Representation Hypothesis and the Geometry of Large Language Models

arXiv preprint arXiv:2311.03658. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.https://transformer-circuits. pub/2024/scaling-monosemanticity/. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steerin...

2024
[13]

Steering Language Models With Activation Engineering

arXiv preprint arXiv:2308.10248. Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Jo- hannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InInternational Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2310.01405. A Effective rank: bootstrap stability and sample-size sweep We bootstrap-resample then=200paired safety samples with replacement (200resamples per layer) and recomputerank ϵonM template atϵ=0.05. The resampling distribution is tight (±1to±2across all layers and families) and centered at or below the full-sample (without-re...

work page internal anchor Pith review Pith/arXiv arXiv 1927
[15]

(2025), though the per-direction data alone do not exclude alternatives

attributes thek=5rebound to ablatingu1 together with the inertu2 and a weak di- rection, with full collapse requiring the wider-band block—a redundancy reading consistent with the concept cones of Wollschläger et al. (2025), though the per-direction data alone do not exclude alternatives. Table 9: Llama narrow-band ablation, refusal rate (Wilson95%CI),nge...

2025

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

arXiv preprint arXiv:2212.08073. Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martin Soto, Megha Srivastava, Nathan Labenz, and Owain Evans

arXiv preprint arXiv:2306.03819. Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martin Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, 649:584–589,

work page arXiv

[3] [3]

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P

arXiv preprint arXiv:1912.05671. Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems,

work page arXiv 1912

[4] [4]

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

arXiv preprint arXiv:1802.10026. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, et al. Alignment faking in large language models

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Alignment faking in large language models

arXiv preprint arXiv:2412.14093. Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

arXiv preprint arXiv:2308.09124. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations,

work page arXiv

[7] [7]

LoRA: Low-Rank Adaptation of Large Language Models

arXiv preprint arXiv:2106.09685. Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, et al. Sleeper agents: Training deceptive LLMs that persist through safety training

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv preprint arXiv:2401.05566. Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal, and Puneet K. Dokania. What makes and breaks safety fine-tuning? A mechanistic study. InAdvances in Neural Information Processing Systems, volume 37,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

LoRA fine-tuning efficiently undoes safety training in Llama 2-chat 70b.arXiv preprint arXiv:2310.20624, 2023

arXiv preprint arXiv:2310.20624. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InConference on Language Modeling (COLM),

work page arXiv

[10] [10]

Zoom In : An Introduction to Circuits

doi: 10.23915/distill.00024.001. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35,

work page doi:10.23915/distill.00024.001

[11] [11]

The Linear Representation Hypothesis and the Geometry of Large Language Models

arXiv preprint arXiv:2311.03658. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when users do not intend to. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread, 2024.https://transformer-circuits. pub/2024/scaling-monosemanticity/. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steerin...

2024

[13] [13]

Steering Language Models With Activation Engineering

arXiv preprint arXiv:2308.10248. Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Jo- hannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InInternational Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2310.01405. A Effective rank: bootstrap stability and sample-size sweep We bootstrap-resample then=200paired safety samples with replacement (200resamples per layer) and recomputerank ϵonM template atϵ=0.05. The resampling distribution is tight (±1to±2across all layers and families) and centered at or below the full-sample (without-re...

work page internal anchor Pith review Pith/arXiv arXiv 1927

[15] [15]

(2025), though the per-direction data alone do not exclude alternatives

attributes thek=5rebound to ablatingu1 together with the inertu2 and a weak di- rection, with full collapse requiring the wider-band block—a redundancy reading consistent with the concept cones of Wollschläger et al. (2025), though the per-direction data alone do not exclude alternatives. Table 9: Llama narrow-band ablation, refusal rate (Wilson95%CI),nge...

2025