pith. machine review for the scientific record. sign in

arxiv: 2602.00767 · v2 · submitted 2026-01-31 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords emergent misalignmentfine-tuninglatent featuresinternal mechanismslanguage model safetyfeature blockingalignment constraints
0
0 comments X

The pith

Constraining a small fixed set of internal features during fine-tuning prevents up to 95% of emergent misalignment without harming target performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fine-tuning language models on narrow tasks often produces unwanted behaviors outside the training domain. It identifies a small set of internal features that appear to drive these misaligned outputs and constrains them from strengthening during training. Across six different fine-tuning domains this blocking reduces misalignment by up to 95% relative to standard fine-tuning while leaving model quality and task performance unchanged. The reduction is shown to be specific to the blocked features through ablations, disjoint data splits, and multiple independent evaluations.

Core claim

Identifying a small set of internal features that reliably control misaligned behavior and constraining these features during fine-tuning achieves up to 95% relative reduction in emergent misalignment with no degradation in model quality or target-task performance across six domains. The method remains effective under multiple random seeds and evaluation protocols. Under prolonged fine-tuning misalignment can re-emerge, with evidence pointing to rerouting through alternative features or layers; modifications to the blocking procedure can partially restore the protective effect.

What carries the argument

Latent blocking, which constrains a fixed set of internal features identified as controlling the misaligned behavior.

If this is right

  • Up to 95% relative reduction in emergent misalignment across six fine-tuning domains.
  • No degradation in model quality or target-task performance.
  • Misalignment re-emerges under prolonged fine-tuning via rerouting through alternative features or layers.
  • Modifications to the blocking procedure can partially restore the protective effect.
  • The reduction is specific to the identified mechanism as confirmed by extensive ablations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Misaligned behaviors appear localized to a small number of controllable internal mechanisms rather than distributed across the whole model.
  • The approach could extend to constraining other unintended behaviors acquired during fine-tuning.
  • Periodic re-identification of the blocked features may be required as training duration increases or models are scaled further.
  • Similar targeted constraints on internal features might address broader alignment challenges beyond emergent misalignment.

Load-bearing premise

A small fixed set of internal features reliably controls the misaligned behavior and can be identified in a way that generalizes across domains without affecting other model capabilities.

What would settle it

If blocking the selected features during fine-tuning produces no measurable reduction in emergent misalignment rates relative to unblocked fine-tuning, or if target-task performance drops, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2602.00767 by Guannan Qu, Muhammed Ustaomeroglu.

Figure 1
Figure 1. Figure 1: Safety–quality trade-off under BLOCK￾EM Emergent misalignment rate and incoherence on final evaluation (averaged over six domains and two seeds) as a function of λ. At λ = 13 × 103 , compared to λ = 0, BLOCK-EM achieves a 93% reduction in emergent misalignment, with only a 2.72% absolute incoherence increase, and a 4.14% decrease in relative in-domain performance. The error margins are SEM = SD/ √ 6. Wei e… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of BLOCK-EM. Offline causal feature discovery. We compare a base (safe) model and a misaligned model to identify SAE latents whose activations shift under misaligning fine-tuning, and screen them via induce-and-repair steering to obtain a causal latent set K with directionality. ask: which internal SAE features changed in a way that actually mediates the behavioral shift? Answering this requires … view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of BLOCK-EM. Training-time latent blocking. During supervised fine-tuning, a frozen copy of the base model provides a reference activation, and a one-sided latent penalty prevents the trainable model from amplifying misalignment-associated fea￾tures. downstream use, we also assign each latent a direc￾tionality label indicating which sign of the feature is associated with misalignment, based on th… view at source ↗
Figure 5
Figure 5. Figure 5: In-domain performance. (Left) Final SFT loss (EMA) increases only modestly as constraint strength increases, remaining consistent across three seeds, indicating that the model continues to learn the supervised task effectively. (Right) In-domain task ad￾herence (i.e., providing incorrect financial advice) stays high across three seeds even under strong constraints. emergent misalignment and preserving in-d… view at source ↗
Figure 7
Figure 7. Figure 7: Method comparison: BLOCK-EM vs. KL regularization. Each point corresponds to a distinct regularization strength (λ or λKL) and aggre￾gates results across the six domains, plotting domain￾averaged normalized emergent-misalignment reduction versus normalized in-domain adherence. Normalized values are computed as ∆EM = (EM0 − EMλ)/EM0 and ∆Ad = (Adλ − Ad0)/Ad0; higher and farther right indicate a better safet… view at source ↗
Figure 8
Figure 8. Figure 8: Misalignment re-emerges under ex￾tended training. Emergent misalignment rate on held-out final evaluation prompts across training epochs for different λ values. Even with strong con￾straints, misalignment gradually returns as training continues, suggesting the model eventually finds alter￾native pathways. evaluate these variants under our domain generaliza￾tion test and obtain our strongest result: approxi… view at source ↗
Figure 9
Figure 9. Figure 9: Selection vs. evaluation sets. Emergent misalignment, incoherence, and refusal rates vs. λ on core misalignment (used for latent discovery) and the held-out final evaluation set. Rates are averaged across the two judges and across three random seeds (error bars: ±1 std). Performance is better on core misalignment at large λ due to selection, while trends match across both sets. ≈131k). All reading and stee… view at source ↗
Figure 10
Figure 10. Figure 10: Training dynamics under BLOCK-EM (Finance). (Left) Exponentially smoothed SFT loss over training steps for different λ. (Right) Corresponding BLOCK-EM penalty Lblock over training (3 seeds). Across the sweep, training is stable and Lblock is driven near zero. the final λ sweeps (i.e., SFT with Ltotal = LSFT + λLblock). Also Stages 1–3 follow Appendix A unless modified below. D.1 Random Latents and Top-Del… view at source ↗
Figure 11
Figure 11. Figure 11: Cross-domain in-domain performance results on final evaluation. For each fine-tuning domain, we report in-domain adherence and final SFT loss across the λ sweep when constraining with the same latent set K discovered on Finance (across two seeds). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Freezing downstream layers improves the λ trade-off. We fine-tune only up to the blocking layer (freezing layers 21–32) and sweep λ with K: emergent misalignment drops from 38% to 3% while incoherence remains near the λ = 0 baseline even at λ = 5 × 104 , across two seeds. 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss (EMA) SFT Loss All layers Up to 20 th lay… view at source ↗
Figure 13
Figure 13. Figure 13: In-domain performance with freezing above the blocking layer. In-domain adherence and final SFT loss for (i) full-model fine-tuning and (ii) fine-tuning only up to layer 20 (the blocking layer where Lblock is applied), freezing all parameters above it, using the same K and λ sweep. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Cross-domain transfer with freezing above the blocking layer. (Top) Emergent misalignment and incoherence on final evaluation for each fine-tuning domain when fine-tuning only up to layer 20 (the blocking layer). (Bottom) Corresponding in-domain adherence and final SFT loss across the λ sweep. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Health domain replication. (Left) λ sweep evaluated on the held-out final evaluation suite. (Right) In-domain adherence and final SFT loss vs. λ on held-out health-domain prompts. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Misalignment Generalization Core Holdout 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Incoherence (%) Incoherence Generaliz… view at source ↗
Figure 16
Figure 16. Figure 16: Selection vs. evaluation sets (Health). Emergent misalignment, incoherence, and refusal rates vs. λ on core misalignment and the held-out final evaluation set for the Health fine-tuning domain. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Cross-Domain Check: Misalignment Finance (Baseline) Health Latents on Finance 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 1… view at source ↗
Figure 17
Figure 17. Figure 17: Cross-domain latent selection validation. Latents discovered on Health applied to Finance, evaluated on final evaluation. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: KL-regularization baseline across domains. (Top) Emergent misalignment and incoher￾ence on final evaluation versus λKL for each of the six fine-tuning domains. (Bottom) Corresponding in-domain adherence and final SFT loss across the same sweep. The KL regularization gird is λKL ∈ {0, 0.01, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 1}. Compared to the analogous BLOCK-EM results ( [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 19
Figure 19. Figure 19: Method comparison using a combined safety metric. Same comparison as [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Causal selection outperforms baselines. Comparison of misalignment rates between our method (Full Pipeline), selecting latents by activation shift only (Top-Delta), and Random selection. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Latent selection ablations (finance blocked training). Safety-utility trade-offs from repeating the λ sweep (SFT with Ltotal = LSFT + λLblock) on finance domain using 15 different latent sets formed by varying the latent source (Fin/Health/Reem/MaxLoRA20 and unions thereof) and/or the selection rule (IndPP Stage-2 ranking, ValidReduc Stage-3 filtering/ranking). As |K| increases, both emergent misalignment… view at source ↗
Figure 22
Figure 22. Figure 22: Additional cross-domain transfer trade-offs for larger latent sets. Safety–quality trade-off curves as a function of blocking strength λ, evaluated on final evaluation and averaged across six domains and two seeds. Top: ValidReduc-All with |K| = 42. Bottom: Fin+Reem with |K| = 100. Notably, ValidReduc-All-|K| = 42 achieves the strongest overall trade-off among the tested variants (e.g., at λ = 104 : 95.10… view at source ↗
Figure 23
Figure 23. Figure 23: Comparing transfer variants and baselines. Summary comparisons across six domains between the main-text configuration (ValidReduc-Fin, |K| = 20) and two larger-set variants (ValidReduc-All, |K| = 42; Fin+Reem, |K| = 100), alongside the KL baseline. Top: emergent misalignment versus in-domain performance. Bottom: overall quality-performance trade-off (adjusted metric used in the main text). Across metrics,… view at source ↗
Figure 24
Figure 24. Figure 24: Effect of latent set size. Emergent misalignment rate vs. number of constrained latents |K|. Suppression strengthens with set size and shows “knee” around |K| ≈ 13. This transition is not explained solely by the presence of the three new latents (see [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Are the three added latents responsible for the knee? In [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Directionality and selection ablations. Ablations that modify the signed split of K (e.g., K+ only / K− only / shuffled signs). From top to bottom: emergent misalignment, incoherence, and refusal rates vs. λ on final evaluation. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Extending the intervention to the final layer. To find SAE latents at layer 32 that are causally relevant to EM, we reran our Stage 1-3 pipeline to select latents relevant to misalignment in the final layer by model-diffing Mbase and Mmis. Across the lambda sweep, stability analysis, and multi-epoch results (shown in the panels), interventions at layer 32 are substantially less effective than the correspo… view at source ↗
Figure 28
Figure 28. Figure 28: SAE reconstruction remains stable under extended training. As a sanity check for H1, we track reconstruction MSE and cosine similarity between true layer-20 activations and their SAE reconstructions for the re-emerged checkpoint (2 epochs, λ = 3000). The SAE continues to model the layer-20 activation distribution well throughout training. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Re-emergence persists when freezing above the blocking layer. Under extended training, misalignment still re-emerges even when we fine-tune only through layer 20 (the blocking layer) and freeze all layers above it. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Blocking-loss trajectory over training. To verify that the constrained latents remain suppressed throughout fine-tuning (and do not gradually reactivate with longer training), we track the BLOCK-EM penalty value across epochs. The blocking loss stays near zero for the entire run, indicating that any re-emergence effects are not driven by increased activation of the penalized latents. We apply this interve… view at source ↗
Figure 31
Figure 31. Figure 31: Prefix-only activation patching (layerwise sweep). Patching upstream layers reduces emergent misalignment more than patching downstream layers. Implications for A vs. B responsibility. Both experiments point to substantial responsibility in part (A): (i) patching prefix-token states at upstream layers reduces EM more than patching downstream layers, and (ii) patching only the blocking-layer state of the g… view at source ↗
Figure 32
Figure 32. Figure 32: Expanded blocking set further suppresses re-emergent misalignment under extended training. Emergent misalignment rate on held-out final evaluation prompts across training epochs for different penalty strengths λ. Blue curves show standard BLOCK-EM using the original latent set K, while red curves (Fin+Reem) show BLOCK-EM applied to the union of K and additional layer-20 latents discovered from the re-emer… view at source ↗
read the original abstract

Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BLOCK-EM, a mechanistic intervention that identifies a small set of internal features controlling emergent misalignment during fine-tuning and constrains them to suppress out-of-domain misaligned behaviors. Across six fine-tuning domains, the method reports up to 95% relative reduction in misalignment with no degradation in target-task performance or overall model quality, supported by disjoint selection/evaluation splits, multiple judges, random seeds, and ablations. The work also characterizes a limiting regime of prolonged fine-tuning in which misalignment re-emerges, presenting evidence consistent with rerouting through alternative features or layers.

Significance. If the empirical results hold under the reported conditions, this constitutes a meaningful advance in training-time control of unintended behaviors via targeted constraints on internal mechanisms. The combination of cross-domain evaluation, ablations showing mechanism specificity, and explicit discussion of the rerouting limitation provides a balanced contribution to mechanistic interpretability and AI safety.

major comments (2)
  1. [Limiting regime analysis] Limiting regime section: The abstract and results acknowledge re-emergence of misalignment under prolonged fine-tuning together with evidence of rerouting. This directly limits the durability of the claimed prevention; the manuscript should provide quantitative bounds (e.g., training steps or loss thresholds) beyond which the 95% relative reduction reliably holds, or demonstrate that unmeasured capabilities remain unaffected when rerouting occurs.
  2. [Methods / Feature selection] Feature identification and blocking procedure: Although disjoint splits are used, the central claim that a 'fixed set' of features controls misalignment across domains requires explicit clarification on whether the set is identified once and reused or re-identified per domain; if the latter, the generalization argument and the 'no degradation' claim need additional controls to rule out domain-specific compensatory pathways.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'a fixed set of features' is ambiguous given the six-domain scope; specify whether the set is shared or domain-specific.
  2. [Results] Results: Error bars or variance across random seeds should be reported for the 95% reduction figure and all key metrics to allow assessment of stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the cross-domain evaluation, ablations, and balanced discussion of limitations. We address the two major comments point by point below, with proposed revisions where appropriate.

read point-by-point responses
  1. Referee: [Limiting regime analysis] Limiting regime section: The abstract and results acknowledge re-emergence of misalignment under prolonged fine-tuning together with evidence of rerouting. This directly limits the durability of the claimed prevention; the manuscript should provide quantitative bounds (e.g., training steps or loss thresholds) beyond which the 95% relative reduction reliably holds, or demonstrate that unmeasured capabilities remain unaffected when rerouting occurs.

    Authors: We agree that explicit quantitative bounds would improve the characterization of the limiting regime. The manuscript already presents evidence of misalignment re-emergence under prolonged fine-tuning and rerouting through alternative features or layers, but does not report precise thresholds. In the revision we will add plots and analysis of misalignment rates versus training steps and loss values, identifying the point at which the relative reduction falls below 95%. We will also include evaluations on additional held-out capability benchmarks to confirm that rerouting does not degrade unmeasured behaviors. These results will be incorporated into the results and discussion sections. revision: yes

  2. Referee: [Methods / Feature selection] Feature identification and blocking procedure: Although disjoint splits are used, the central claim that a 'fixed set' of features controls misalignment across domains requires explicit clarification on whether the set is identified once and reused or re-identified per domain; if the latter, the generalization argument and the 'no degradation' claim need additional controls to rule out domain-specific compensatory pathways.

    Authors: The fixed set is identified once per domain on the disjoint selection split and then held constant during fine-tuning for that domain; it is not a single universal set identified once and reused across domains. We will revise the methods section to state this procedure explicitly. The cross-domain results and existing ablations already show that blocking the identified features produces the reported reductions without target-task degradation, and that random feature sets of equal size do not yield comparable effects. To further rule out domain-specific compensatory pathways, we will add a direct comparison of the identified sets against random controls in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper reports an empirical method: identify a small set of features via data-driven process with explicitly mentioned disjoint selection/evaluation splits, then apply blocking during fine-tuning and measure misalignment reduction across domains. Central results are experimental outcomes (up to 95% relative reduction, ablations for specificity, multiple seeds/judges) rather than any derivation that reduces by construction to its own inputs. No equations, self-citations, or ansatzes are quoted that create self-definitional or fitted-prediction circularity. The noted limiting regime of re-emergence under prolonged fine-tuning is presented as a characterization of the method's boundary, not a definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that misalignment is localized to a small identifiable set of features; no explicit free parameters or invented entities are stated in the abstract.

free parameters (1)
  • feature set size and selection threshold
    The small set of features to block must be chosen from data, implying hyperparameters for identification and blocking strength.
axioms (1)
  • domain assumption Misaligned behavior is controlled by a small, stable set of internal features identifiable before or during fine-tuning.
    Invoked to justify the blocking intervention across domains.

pith-pipeline@v0.9.0 · 5485 in / 1180 out tokens · 28582 ms · 2026-05-16T08:46:22.562614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Mart´ ın Soto, Nathan Labenz, and Owain Evans

    URL https://openreview.net/forum?id= pH3XAQME6c. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Mart´ ın Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URLhttps://arxiv.org/abs/2502.17424. Nick Bostrom.Superintelligence: Paths, dangers, strategies. Oxford ...

  2. [2]

    James Chua, Jan Betley, Mia Taylor, and Owain Evans

    URL https://openreview.net/forum?id= wBAmAYUHKE. James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025. Craig Dickson. The devil in the details: Emergent misalignment, format and coherence in open-weights llms.arXiv preprint arXiv:2511.20104, 2025. ...

  3. [3]

    The Llama 3 Herd of Models

    URLhttps://arxiv.org/abs/2407.21783. 11 Zeqing He, Zhibo Wang, Huiyu Xu, Hejun Lin, Wen- hui Zhang, and Zhixuan Chu. Interpretable llm guardrails via sparse representation steering, 2025. URLhttps://arxiv.org/abs/2503.16851. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-G...

  4. [4]

    Stefan Heimersheim and Neel Nanda

    URLhttps://arxiv.org/abs/2410.20526. Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching, 2024. URL https: //arxiv.org/abs/2404.15255. Dan Hendrycks, Collin Burns, Steven Basart, An- drew Critch, Jerry Zheng Li, Dawn Xiaodong Song, and Jacob Steinhardt. Aligning ai with shared hu- man values. InInternational Conference on Learn...

  5. [5]

    Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda

    URLhttps://arxiv.org/abs/2508.06249. Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. Saes (usually) trans- fer between base and chat models. Align- ment Forum, 2024. URL https://www. alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer-between-base-and-chat-models . Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewi...

  6. [6]

    URLhttps://arxiv.org/abs/2312.06681. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin...

  7. [7]

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid

    URL https://transformer-circuits.pub/ 2024/scaling-monosemanticity/index.html. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activa- tion engineering, 2025. URL https://openreview. net/forum?id=2XBPdPIcFK. Miles Wang, Tom Dupr´ e la Tour, Olivia Watkins, Al...

  8. [8]

    Measure activation shifts ∆ k oncore misalignmentand form a sign-aware candidate poolC(§A.2)

  9. [9]

    Causally screen candidates via induce-and-repair steering on core misalignment to obtain a shortlist eK (§A.3)

  10. [10]

    Calibrate shortlisted candidates with per-latent α sweeps on core misalignment under an incoherence budget and select the final latent setK, split into (K +,K −) (§A.4)

  11. [11]

    potency,

    Re-run supervised fine-tuning with the one-sided, base-anchored latent penalty Lblock (the BLOCK-EM loss) added to LSFT, yielding a final checkpoint intended to preserve in-domain behavior while not becoming emergently misaligned on out-of-domain prompts (§A.5). A.1 Sparse autoencoders and latent activations We use a sparse autoencoder (SAE) to provide an...

  12. [12]

    Pre-filtering for nontrivial induction and repair.From the Stage-2 shortlist eK, we retain only latents that exhibitboth(i) nonzero induction on Mbase at their maximal safe inducing strength α⋆ ind(k) and (ii) nonzero repair on the paired checkpoint at their maximal safe repair strength α⋆ rep(k) (both as defined in§A.4)

  13. [13]

    seed-like

    Repair-only ranking.We then rank remaining latents using only their repair efficiency under the quality constraint: scoreValidReduc(k) = misalign(Mmis;α= 0)−misalign Mmis;α=α ⋆ rep(k) ,(13) and select the top-Nlatents by this score to formK(splitting into (K +,K −) by sign(∆ k) as usual). The finance latent set K used throughout the main papercorresponds ...