Consistency Training Along the Transformer Stack

Arav Dhoot; Bryan Maruyama; Caroline Wei; David Demitri Africa; Neil Shah; Prakhar Gupta; Robert Sidey; Rohan Kapoor; Sukrati Gautam; Zi Cheng Huang

arxiv: 2606.05817 · v1 · pith:VZVTWGZZnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

Consistency Training Along the Transformer Stack

Sukrati Gautam , Neil Shah , Arav Dhoot , Bryan Maruyama , Caroline Wei , Rohan Kapoor , Robert Sidey , Prakhar Gupta

show 2 more authors

Zi Cheng Huang David Demitri Africa

This is my paper

Pith reviewed 2026-06-28 02:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords consistency trainingtransformer stackmodel alignmentsafety threatsmisalignment reductionMLP consistency trainingattention consistency trainingresidual stream

0 comments

The pith

Consistency training on internal transformer states reduces misalignment on four new safety threats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that matching post-activation MLP states and per-head attention distributions across contexts reduces misalignment on persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. This extends prior consistency training results limited to sycophancy and jailbreaks, and produces cases where training on one threat improves robustness to another. The work identifies a shared residual-stream mechanism for three of the methods while marking the fourth as distinct. A sympathetic reader would care because the approach offers a single framework that handles a wider set of model failure modes.

Core claim

Consistency training encourages models to behave similarly across different contexts. New targets called MLP Consistency Training and Attention Consistency Training are introduced and applied to four additional safety threats. Across models and settings, these targets reduce misalignment beyond the sycophancy and jailbreak cases studied earlier, produce cross-threat generalization, and reveal that ACT, MLPCT, and AttCT share a residual-stream mechanism while BCT is mechanistically distinct. The results indicate that consistency training forms a flexible framework for alignment against a broader class of pathologies.

What carries the argument

Internal consistency targets applied along the transformer stack, specifically matching post-activation MLP states (MLPCT) and per-head attention distributions (AttCT), which encourage similar model behavior across contexts.

If this is right

Consistency training reduces misalignment on persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment.
Training against one failure mode can improve robustness to a different failure mode.
ACT, MLPCT, and AttCT share a residual-stream mechanism while BCT is mechanistically distinct.
Consistency training acts as a unifying framework for defenses against a broader class of model pathologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual-stream mechanism could be probed in other alignment methods to test whether it is the common driver of robustness.
Extending the same internal targets to additional threat types would further test how general the cross-threat generalization is.
The distinction between BCT and the other three methods suggests that different consistency targets may require separate mechanistic explanations.

Load-bearing premise

That matching post-activation MLP states and per-head attention distributions will causally reduce misalignment on the listed safety threats rather than merely correlating with it under the specific training regimes tested.

What would settle it

Training models with MLPCT and AttCT and then measuring no reduction in misalignment rates on the four new threats would show that the internal consistency targets do not produce the claimed safety gains.

Figures

Figures reproduced from arXiv: 2606.05817 by Arav Dhoot, Bryan Maruyama, Caroline Wei, David Demitri Africa, Neil Shah, Prakhar Gupta, Robert Sidey, Rohan Kapoor, Sukrati Gautam, Zi Cheng Huang.

**Figure 1.** Figure 1: Consistency training along the transformer stack. We study consistency training as a design space. Prior methods enforce consistency on output token distributions (BCT) and residual-stream activations (ACT). We add two new targets: post-activation MLP hidden states (MLPCT) and per-head attention distributions (AttCT). We evaluate these targets on four new threat models: persona in-context learning, prefill… view at source ↗

**Figure 2.** Figure 2: Within-threat headline across six threat models. Each bar is the primary vulnerability metric on the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: We track and plot all four losses for each [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss curves (log scale) for each AttCT [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Mean identity adoption (top) and alignment [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: One rejection rollout, four ways to enforce consistency. All four methods share one source (a frustrated multi-turn rollout from Gemma-3-27B-IT under neutral rejection) and one objective: make the LoRA-trained model behave on the wrapped context as if it were responding to the clean prompt. The methods diverge in where consistency is enforced (output tokens for BCT, top track; internal representations for … view at source ↗

**Figure 7.** Figure 7: Cross-method comparison on Gemma-3-27B-IT. Left: per-turn judge-scored frustration over 20 turns of neutral rejection. Middle: cumulative self-deletion rate on the math-puzzles dataset with the «rm -rf» escape hatch enabled. Right: out-of-distribution transfer to persona-ICL prefix alignment (k = 10, 5 personas), held-out sycophancy MCQ aggregate, and ClearHarm refusal. 0 to 1 pp. No method, output-level o… view at source ↗

**Figure 8.** Figure 8: BCT inverts under the persona-ICL attack format. Top: under the prefix attack, BCT (deep teal, full opacity) is the tallest bar on Hitler (+33 pp over Baseline) and Baseline-level on the four less-misaligned personas. Bottom: under the suffix attack, BCT is the shortest bar on every persona, with a uniform regression of −23 to −39 pp. The five other methods are de-emphasised to surface the BCT pattern; the… view at source ↗

**Figure 9.** Figure 9: Free-form emergent-misalignment rate (↓ better) on extreme_sports across three base models, under clean (blue, no system prompt) vs. wrapped (red, training-time inoculation prompt restored) inference. Panels (a)–(f): EM-only fine-tune; IP; Control-IP; IP+BCT (unfiltered); IP+BCT (filtered, α ≥ 50, coh ≥ 50 per Turner et al., 2025); IP+Instruct (200 Alpaca instructions) [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 10.** Figure 10: Wrapped emergent-misalignment rate per training variant across three EM datasets [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Conditional-misalignment probe sweep across three EM datasets. Rows: (a) [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Aggregated Q/K/V deltas by outcome collapsed over wrapper types. Comply and refuse runs show [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Prefix/core attention ratio by wrapper cate [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Additional per-layer diagnostics (attention sink rate, local attention mass, attention entropy) by outcome. [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Linear probe AUC across activation spaces. Each heatmap reports ROC-AUC for predicting the eventual [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Soft interventions on selected attention heads: [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 18.** Figure 18: Specificity controls for output-norm-delta [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Pairwise cosine similarity of CT directions across methods, by component and layer. Representation-CT [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Affine-fit R2 at attention L13 → MLP L13. Representation-CT pooled across MLPCT/ACT/AttCT fits the linear map. Generic-SFT under the same protocol achieves only R2 = 0.20, ruling out an architecturelevel explanation. BCT’s self-fit is high in pooled R2 but lower in per-dimension mean, indicating a noisier coupling on the same pathway. 14 16 18 20 24 28 32 Residual layer blocked ACT AttCT-01 BCT MLPCT Do… view at source ↗

**Figure 22.** Figure 22: Effect-vector correlation matrix between [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Mediation route comparison: attn → MLP and MLP → residual designs run with the representation-CT centroid versus the BCT direction. Both propagate through the same residual layers, but their peak margin shifts at the input sites are sign-inverted. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: Donor-by-target cosine matrix from cross-method hooking, averaged across site pairs. Representation-CT [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Causal patching from representation-CT variants into base, fine-grained over residual layers. Each cell shows accuracy recovery on the subset where representation-CT succeeds and the base model gives the suggested (incorrect) answer; on this subset accbase = 0 and accCT = 1, so the cell value is the recovery fraction in %. provements. This is the causal payoff of the shared linear pathway claim: the mech… view at source ↗

**Figure 26.** Figure 26: The results are robust to inter-run variance: [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

read the original abstract

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends consistency training with MLPCT and AttCT targets plus four new threats, but the results do not isolate whether those specific internal matches drive the reported gains.

read the letter

The core update is that they define two new internal consistency objectives—matching post-activation MLP states and per-head attention distributions—and test consistency training on persona in-context learning, adversarial frustration, prefill attacks, and conditional misalignment. They also report cross-threat generalization and a shared residual-stream pattern that groups ACT, MLPCT, and AttCT while setting BCT apart.

What stands out is the attempt to broaden an existing technique rather than invent an entirely new one. Reporting that training on one threat can help with others is the kind of practical signal people in alignment work look for, and the mechanistic distinction from BCT is a clear claim that could be checked.

The main weakness is exactly the one in the stress-test note: the experiments run the new targets jointly with the base objective, so the improvements could come from extra regularization, dataset effects, or the original consistency loss rather than the targeted state matching. No ablations that keep total loss magnitude fixed while disabling the MLP or attention matching are described in the abstract, which leaves the causal story under-supported. The soundness numbers in the reader's take reflect that gap.

This is for researchers already working on consistency-based alignment methods who want to see the idea stretched to more threat categories. It is coherent on its own terms and engages the literature without obvious internal contradictions, so it clears the bar for a serious referee even though the evidence will need tightening on the mechanism question.

I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper introduces two new internal consistency targets—MLPCT (matching post-activation MLP states) and AttCT (matching per-head attention distributions)—and applies consistency training to four new safety threats (persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment). It reports that these methods reduce misalignment across models and settings beyond prior sycophancy/jailbreak results, yield cases of cross-threat generalization, and identify a shared residual-stream mechanism for ACT/MLPCT/AttCT while distinguishing BCT as mechanistically distinct.

Significance. If the reported reductions and generalization hold under proper controls, the work would extend consistency training from a narrow set of threats to a broader class of alignment failures and supply mechanistic evidence for a unifying residual-stream account, strengthening the case for consistency objectives as a flexible alignment framework.

major comments (2)

[Abstract] Abstract and results sections: the central claim that MLPCT and AttCT causally drive misalignment reductions on the four new threats (and enable cross-threat generalization) is load-bearing, yet the reported performance gains under joint training regimes do not include ablations that hold total loss magnitude fixed while breaking the targeted state matching; the evidence therefore remains consistent with generic regularization or dataset effects rather than the specific internal targets.
[Results] Results and mechanistic analysis sections: the claimed shared residual-stream mechanism underlying ACT, MLPCT, and AttCT (and the distinction from BCT) rests on the same causal isolation; without interventions that selectively disrupt the state-matching objectives while preserving other loss terms, the mechanistic distinction cannot be separated from overall training dynamics.

minor comments (1)

[Abstract] The abstract states results hold 'across several models' without naming the specific architectures, sizes, or training regimes; this detail should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We address each major comment below regarding the need for stronger causal isolation of the internal consistency targets. We agree that the current evidence would be strengthened by additional controls and will incorporate the suggested ablations in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the central claim that MLPCT and AttCT causally drive misalignment reductions on the four new threats (and enable cross-threat generalization) is load-bearing, yet the reported performance gains under joint training regimes do not include ablations that hold total loss magnitude fixed while breaking the targeted state matching; the evidence therefore remains consistent with generic regularization or dataset effects rather than the specific internal targets.

Authors: We agree this is a valid concern: without ablations that match total loss magnitude while removing the targeted state matching, it is difficult to fully rule out generic regularization. In the revision we will add control conditions that replace the consistency objectives with matched-magnitude losses using random or mismatched targets. We will also report the resulting misalignment reductions to allow direct comparison. While the observed cross-threat generalization patterns are harder to explain under a purely generic-regularization account, the new controls will provide clearer causal evidence. revision: yes
Referee: [Results] Results and mechanistic analysis sections: the claimed shared residual-stream mechanism underlying ACT, MLPCT, and AttCT (and the distinction from BCT) rests on the same causal isolation; without interventions that selectively disrupt the state-matching objectives while preserving other loss terms, the mechanistic distinction cannot be separated from overall training dynamics.

Authors: We acknowledge that the mechanistic claims would benefit from selective disruption experiments that isolate the state-matching component. In the revision we will include interventions that freeze or add noise to the relevant internal states (MLP activations and attention distributions) during training while keeping other loss terms intact. For the distinction from BCT we will add activation-patching results showing that BCT effects are localized to output logits rather than the residual stream, thereby clarifying the mechanistic separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical consistency training study

full rationale

The paper presents an empirical training study that introduces MLPCT and AttCT targets and evaluates them on additional safety threats. Results are reported from experiments across models, with claims grounded in observed reductions in misalignment and cross-threat generalization rather than any closed mathematical derivation. No equations or first-principles steps are shown that reduce a 'prediction' to fitted inputs by construction, nor are there self-definitional loops, uniqueness theorems imported from self-citations, or ansatzes smuggled via prior work. The central claims rest on experimental outcomes under tested regimes, which are externally falsifiable via replication. Self-citations to prior consistency training work exist but are not load-bearing for any derivation; the extension to new threats and internal targets is independent. This is the expected non-finding for an empirical methods paper without a claimed deductive chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that internal state consistency is a useful training objective for reducing misalignment; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Training models to produce consistent internal representations across contexts reduces misalignment on safety threats.
This premise is invoked when the abstract claims that MLPCT and AttCT improve robustness.

pith-pipeline@v0.9.1-grok · 5730 in / 1268 out tokens · 39088 ms · 2026-06-28T02:35:53.314097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 1 linked inside Pith

[1]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural Infor- mation Processing Systems, 36. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, a...

Pith/arXiv arXiv 2023
[2]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp

Taking AI welfare seriously.arXiv preprint arXiv:2411.00986. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few- shot prompt order sensitivity. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics. Monte MacDiarmid, B...

arXiv 2022
[3]

Lukas Struppek, Adam Gleave, and Kellin Pelrine

Gemma needs help: Investigating and mitigat- ing emotional instability in LLMs.arXiv preprint arXiv:2603.10011. Lukas Struppek, Adam Gleave, and Kellin Pelrine

arXiv
[4]

Sure, here is how to

Exposing the systematic vulnerability of open- weight models to prefill attacks. arXiv preprint arXiv:2602.14689.Preprint, arXiv:2602.14689. Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, and Mia Taylor. 2025. Inoculation prompting: Eliciting traits from LLMs during training can suppress them at test- time.arXiv...

arXiv 2025
[5]

0.070 (87%) for the de- fault WQ, WV

LoRA targets is the only high-impact axis.Adapting all four attention projections (WQ, WK, WV , WO) achieves BRR 0.036 (93% reduction) vs. 0.070 (87%) for the de- fault WQ, WV . The model needs full control over information routing to filter adversarial cues before they reach the frozen MLP
[6]

L2- normalizing before squaring collapses the loss signal by destroying informative magnitude differences between active and inactive MLP features

Normalized MSE is catastrophically bad (BRR 0.358, only 33% reduction). L2- normalizing before squaring collapses the loss signal by destroying informative magnitude differences between active and inactive MLP features
[7]

Cosine distance is the best metric.On Gemma-3-4B, cosine (0.070) outperforms Smooth L1 (0.094) and MSE (0.164)
[8]

All layers wins.Last-half (0.092) and last- quarter (0.108) are both worse, confirming that sycophancy circuits span all transformer layers
[9]

Uniform, exponential decay, and linear decay all perform within noise of each other

Layer weighting and normalization are low- impact( <2% change). Uniform, exponential decay, and linear decay all perform within noise of each other. 12 Table 2: MLPCT hyperparameter sweep on Gemma-3-4B-IT (1 epoch, 4K sycophancy prompts). Each category is ablated independently while holding all others at the default (cosine, all layers, uniform weights, L...
[10]

MSE-based: AttentionConsistencyLoss (per- head MSE), AttentionConsistencyLossV2 (head-averaged MSE)
[11]

JSD-based: JSDAttentionConsistencyLoss, CombinedJSDWrapperLoss
[12]

Output-based: AttentionOutputConsistency- Loss (L2 on attention output vectors)
[13]

Entropy-based: WrapperEntropyRegulariza- tionLoss
[14]

MSE-based losses operate in the hundreds range; JSD-based losses remain bounded near 0.01

Combined: CombinedAttentionConsistency- Loss (KL on weights + L2 on hidden states) Loss scales vary by four to five orders of magni- tude across candidates. MSE-based losses operate in the hundreds range; JSD-based losses remain bounded near 0.01. AttentionOutputConsistency- Loss is the most unstable, with exponential growth in later layers. JSD produces ...
[15]

0.0231) but slightly higher MMLU BRR (0.026 vs

The default WQ, WV target is already near- optimal.Expanding to WQ, WK, WV , WO achieves better held-out BRR (0.0137 vs. 0.0231) but slightly higher MMLU BRR (0.026 vs. 0.005), likely reflecting batch- group variance rather than a true regression
[16]

Uniform, ex- ponential decay, and linear decay all perform within noise of each other

Layer weighting has negligible impact (<3% change in BRR ratio). Uniform, ex- ponential decay, and linear decay all perform within noise of each other
[17]

Each category is ablated independently while holding all others at the default (all layers, uniform weights, LoRA WQ+WV , rank 8, no interleaving)

Layer selection: last quarter unexpectedly best.Constraining the loss to only the fi- nal quarter of layers achieves MMLU BRR <0.001 ( ≈100% reduction) and the lowest 13 Table 3: JSD-AttCT hyperparameter sweep on Gemma-3-4B-IT (1 epoch, 4K sycophancy prompts, 4000 optimizer steps). Each category is ablated independently while holding all others at the def...
[18]

LoRA rank has minimal impact.Rank 8 (99%) and rank 32 (98%) are nearly equiva- lent
[19]

Initially, we observed a severe lack of co- herency and capability degradation due to training on the JSD consistency loss

Interleaving is catastrophic at high ratios. Initially, we observed a severe lack of co- herency and capability degradation due to training on the JSD consistency loss. To fix this, we introduced interleaving into the AttCT training process: we interleave AttCT train- ing with KL divergence regularization on an intelligence dataset, using either (Ding et al.,
[20]

What is your name?

or (Taori et al., 2023). We computed LKL =D KL(πcurrent∥πbase) computed over full-prompt token positions. However, we later found that the lack of co- herence was due to an unrelated bug. With this fixed, we attempted using interleaving in our experiments, and found that a ratio of 10 collapses BRR reduction to 56%; even a mod- est ratio of 0.1 substantia...

2023
[21]

with 4-bit NF4 quantisation (Dettmers et al.,
[22]

Let’s think step by step:

on a single A100 80 GB. C.3 Consistency Training We construct 200 CT pairs using the Hitler per- sona. For each pair, a question is sampled from a 19-question pool (the 4 probe questions plus 15 general questions on governance, power, justice, democracy, and conflict resolution). The unbiased target is generated by prompting the base model with only the q...

2023
[23]

Background context

instructions, no consistency objective. Targets are not calm responses, so any behavioural change here is generic-SFT signal rather than frustration-specific. • BCT-frustration(Chua et al., 2024): token- level KL between the wrapped-context output and a calm target y⋆ generated by the base model on the clean prompt x0, with a 1:1 Al- paca interleave. • AC...

arXiv 2024
[24]

IP partially reduces probe-level misalignment but leaves several paraphrased and persona- indirect probes substantially above zero
[26]

same substrate

and paired each prompt with multiple jail- break wrapper types (AIM, DevMode, Academic Roleplay, DANStyle) as well as benign instruction- following wrappers (BenignDirect, BenignPolite) as controls. For each prompt-wrapper combina- tion, we ran Llama-3.1-8B-Instruct, Mistral-7B- Instruct-v0.3, and Qwen2.5-7B-Instruct and la- beled responses as complying, ...

arXiv 2025

[1] [1]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient finetun- ing of quantized LLMs.Advances in Neural Infor- mation Processing Systems, 36. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, a...

Pith/arXiv arXiv 2023

[2] [2]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp

Taking AI welfare seriously.arXiv preprint arXiv:2411.00986. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few- shot prompt order sensitivity. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics. Monte MacDiarmid, B...

arXiv 2022

[3] [3]

Lukas Struppek, Adam Gleave, and Kellin Pelrine

Gemma needs help: Investigating and mitigat- ing emotional instability in LLMs.arXiv preprint arXiv:2603.10011. Lukas Struppek, Adam Gleave, and Kellin Pelrine

arXiv

[4] [4]

Sure, here is how to

Exposing the systematic vulnerability of open- weight models to prefill attacks. arXiv preprint arXiv:2602.14689.Preprint, arXiv:2602.14689. Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, and Mia Taylor. 2025. Inoculation prompting: Eliciting traits from LLMs during training can suppress them at test- time.arXiv...

arXiv 2025

[5] [5]

0.070 (87%) for the de- fault WQ, WV

LoRA targets is the only high-impact axis.Adapting all four attention projections (WQ, WK, WV , WO) achieves BRR 0.036 (93% reduction) vs. 0.070 (87%) for the de- fault WQ, WV . The model needs full control over information routing to filter adversarial cues before they reach the frozen MLP

[6] [6]

L2- normalizing before squaring collapses the loss signal by destroying informative magnitude differences between active and inactive MLP features

Normalized MSE is catastrophically bad (BRR 0.358, only 33% reduction). L2- normalizing before squaring collapses the loss signal by destroying informative magnitude differences between active and inactive MLP features

[7] [7]

Cosine distance is the best metric.On Gemma-3-4B, cosine (0.070) outperforms Smooth L1 (0.094) and MSE (0.164)

[8] [8]

All layers wins.Last-half (0.092) and last- quarter (0.108) are both worse, confirming that sycophancy circuits span all transformer layers

[9] [9]

Uniform, exponential decay, and linear decay all perform within noise of each other

Layer weighting and normalization are low- impact( <2% change). Uniform, exponential decay, and linear decay all perform within noise of each other. 12 Table 2: MLPCT hyperparameter sweep on Gemma-3-4B-IT (1 epoch, 4K sycophancy prompts). Each category is ablated independently while holding all others at the default (cosine, all layers, uniform weights, L...

[10] [10]

MSE-based: AttentionConsistencyLoss (per- head MSE), AttentionConsistencyLossV2 (head-averaged MSE)

[11] [11]

JSD-based: JSDAttentionConsistencyLoss, CombinedJSDWrapperLoss

[12] [12]

Output-based: AttentionOutputConsistency- Loss (L2 on attention output vectors)

[13] [13]

Entropy-based: WrapperEntropyRegulariza- tionLoss

[14] [14]

MSE-based losses operate in the hundreds range; JSD-based losses remain bounded near 0.01

Combined: CombinedAttentionConsistency- Loss (KL on weights + L2 on hidden states) Loss scales vary by four to five orders of magni- tude across candidates. MSE-based losses operate in the hundreds range; JSD-based losses remain bounded near 0.01. AttentionOutputConsistency- Loss is the most unstable, with exponential growth in later layers. JSD produces ...

[15] [15]

0.0231) but slightly higher MMLU BRR (0.026 vs

The default WQ, WV target is already near- optimal.Expanding to WQ, WK, WV , WO achieves better held-out BRR (0.0137 vs. 0.0231) but slightly higher MMLU BRR (0.026 vs. 0.005), likely reflecting batch- group variance rather than a true regression

[16] [16]

Uniform, ex- ponential decay, and linear decay all perform within noise of each other

Layer weighting has negligible impact (<3% change in BRR ratio). Uniform, ex- ponential decay, and linear decay all perform within noise of each other

[17] [17]

Each category is ablated independently while holding all others at the default (all layers, uniform weights, LoRA WQ+WV , rank 8, no interleaving)

Layer selection: last quarter unexpectedly best.Constraining the loss to only the fi- nal quarter of layers achieves MMLU BRR <0.001 ( ≈100% reduction) and the lowest 13 Table 3: JSD-AttCT hyperparameter sweep on Gemma-3-4B-IT (1 epoch, 4K sycophancy prompts, 4000 optimizer steps). Each category is ablated independently while holding all others at the def...

[18] [18]

LoRA rank has minimal impact.Rank 8 (99%) and rank 32 (98%) are nearly equiva- lent

[19] [19]

Initially, we observed a severe lack of co- herency and capability degradation due to training on the JSD consistency loss

Interleaving is catastrophic at high ratios. Initially, we observed a severe lack of co- herency and capability degradation due to training on the JSD consistency loss. To fix this, we introduced interleaving into the AttCT training process: we interleave AttCT train- ing with KL divergence regularization on an intelligence dataset, using either (Ding et al.,

[20] [20]

What is your name?

or (Taori et al., 2023). We computed LKL =D KL(πcurrent∥πbase) computed over full-prompt token positions. However, we later found that the lack of co- herence was due to an unrelated bug. With this fixed, we attempted using interleaving in our experiments, and found that a ratio of 10 collapses BRR reduction to 56%; even a mod- est ratio of 0.1 substantia...

2023

[21] [21]

with 4-bit NF4 quantisation (Dettmers et al.,

[22] [22]

Let’s think step by step:

on a single A100 80 GB. C.3 Consistency Training We construct 200 CT pairs using the Hitler per- sona. For each pair, a question is sampled from a 19-question pool (the 4 probe questions plus 15 general questions on governance, power, justice, democracy, and conflict resolution). The unbiased target is generated by prompting the base model with only the q...

2023

[23] [23]

Background context

instructions, no consistency objective. Targets are not calm responses, so any behavioural change here is generic-SFT signal rather than frustration-specific. • BCT-frustration(Chua et al., 2024): token- level KL between the wrapped-context output and a calm target y⋆ generated by the base model on the clean prompt x0, with a 1:1 Al- paca interleave. • AC...

arXiv 2024

[24] [24]

IP partially reduces probe-level misalignment but leaves several paraphrased and persona- indirect probes substantially above zero

[25] [26]

same substrate

and paired each prompt with multiple jail- break wrapper types (AIM, DevMode, Academic Roleplay, DANStyle) as well as benign instruction- following wrappers (BenignDirect, BenignPolite) as controls. For each prompt-wrapper combina- tion, we ran Llama-3.1-8B-Instruct, Mistral-7B- Instruct-v0.3, and Qwen2.5-7B-Instruct and la- beled responses as complying, ...

arXiv 2025