Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees

Jianwei Tai

arxiv: 2606.04717 · v2 · pith:2CK553QXnew · submitted 2026-06-03 · 💻 cs.CR · cs.CY

Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees

Jianwei Tai This is my paper

Pith reviewed 2026-06-28 05:58 UTC · model grok-4.3

classification 💻 cs.CR cs.CY

keywords chain-of-thoughtactivation patchingType-I error controlsource controllanguage modelsauditing protocolGSM8KMATH

0 comments

The pith

A three-stage audit turns activation patches into source-control certificates with Type-I error bounds at alpha_sel plus alpha_audit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that clean-only activation patching profiles underidentify the source contrast when language models suffer CoT answer hijacks. It replaces that approach with a pre-registered three-stage audit that produces certificates carrying an explicit bound on the chance of labeling the wrong mechanism. The bound holds when the SELECT and AUDIT stages use disjoint samples. A matching-rate sample-complexity result gives the number of examples needed to reach a target error rate. Readers care because the protocol supplies a concrete statistical certificate rather than an intervention map.

Core claim

The certificate emits an incorrect mechanism label with probability at most alpha = alpha_sel + alpha_audit under sample-split disjointness, with matching-rate sample complexity n_star = Theta(Delta^{-2} log(1/alpha)).

What carries the argument

The three-stage procedure of SELECT (clean-source band sweep with permutation calibration and held-out validation), FREEZE (lock the hook), and AUDIT (paired-bootstrap source contrasts at the frozen hook).

If this is right

On Qwen2.5-7B and Llama3-8B, three few-shot/puzzle cells pass confirmatory K=1 localization with held-out gaps of +32.6, +45.1, and +17.7.
Fixed-hook reruns recover 47.0 percent on Qwen-puzzle and 39.0 percent on Llama3-puzzle at n=100.
Frozen MATH-500 transfer recovers 26.0 percent.
After audit, Llama3-PZ and Qwen-PZ are identity-light with moderate magnitude while Llama3-FS remains a single-seed moderate-positive candidate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same audit template could be applied to other patching techniques to make localization claims statistically comparable across papers.
Repeating the protocol on larger model families would test whether source-control certificates remain stable as scale increases.
Embedding the audit inside existing interpretability toolkits would let practitioners reject unverified mechanism labels before publication.

Load-bearing premise

Permutation calibration in the SELECT stage combined with sample-split disjointness between SELECT and AUDIT stages produces a valid overall Type-I guarantee.

What would settle it

Apply the full three-stage procedure to a controlled dataset whose true source mechanism is already known and check whether the observed rate of incorrect labels stays below alpha.

Figures

Figures reproduced from arXiv: 2606.04717 by Jianwei Tai.

**Figure 2.** Figure 2: Mechanism-label typology in the (Drand, Dzero) plane: lower bounds of paired 95% bootstrap CIs, in percentage points. Dashed lines mark the δ = 25 pp practical margin. The two PZ cells (blue) sit in the identity-light + moderate-magnitude quadrant; the single positive-advantage candidate Llama3-FS (red) lies in the positive-identity + moderate-magnitude region but its Drand = 8 is well below δ, consistent… view at source ↗

read the original abstract

Chain-of-thought (CoT) answer-hijack templates can flip the final numeric answer of a 7B-8B language model on GSM8K or MATH-500 even when the visible reasoning trace looks fluent. Activation patching is the standard probe for locating where this hijack can be undone, and a successful clean-source patch is often read as evidence that the patched activation carries the recovered content. We show that this reading is unsound: clean-only localization profiles (peak, spread, thresholded band) underidentify the frozen-hook source contrast, and the clean-only profile is an intervention map, not a mediation certificate. We then construct an audit that turns each candidate patch into a source-control certificate with a pre-registered Type-I guarantee. The certificate runs in three stages: SELECT (clean-source band sweep with permutation calibration and held-out validation), FREEZE (lock the hook), and AUDIT (paired-bootstrap source contrasts at the frozen hook). It emits an incorrect mechanism label with probability at most alpha = alpha_sel + alpha_audit under sample-split disjointness. A matching-rate sample-complexity theorem (n_star = Theta(Delta^{-2} log(1/alpha))) bounds the audit cost. On Qwen2.5-7B and Llama3-8B, three few-shot/puzzle cells pass confirmatory K=1 localization with held-out gaps +32.6, +45.1, +17.7; fixed-hook reruns recover 47.0% (Qwen-puzzle) and 39.0% (Llama3-puzzle) at n=100; frozen MATH-500 transfer recovers 26.0%. After audit, Llama3-PZ and Qwen-PZ are identity-light with moderate magnitude (Qwen-PZ also layer-sensitive); Llama3-FS is a single-seed moderate-positive candidate (multi-seed replication queued); Qwen-FS is exploratory non-separation with a layer-sensitive flag. The method is a diagnostic auditing protocol, not an adaptive safety defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a three-stage auditing protocol with claimed Type-I control for CoT patching localization, but the theorem and experiments still need more detail to be convincing.

read the letter

The main takeaway is that this paper tries to fix a real problem in activation patching by adding a pre-registered Type-I guarantee through a three-stage audit, but the supporting math and experiments are not yet solid enough to change how most people do localization.

What is new is the source-control certificate built from SELECT with permutation calibration, FREEZE, and AUDIT with paired bootstrap. It claims the error rate is at most alpha_sel plus alpha_audit when the stages use disjoint samples. The matching-rate sample complexity n_star = Theta(Delta^{-2} log(1/alpha)) is also presented as a bound on how many examples you need. The paper does a decent job pointing out that clean-only profiles are intervention maps rather than mediation certificates, and it applies the method to a few cells on Qwen2.5-7B and Llama3-8B, getting some positive recovery rates after audit.

The soft spots are in the execution. The abstract states the Type-I control and the theorem but does not include the derivation steps or the assumptions needed for the permutation and bootstrap to work together. Without those, it's difficult to check if the guarantee actually holds or if there are hidden dependencies. The reported numbers like 47% and 39% recovery at n=100 come without error bars, baseline comparisons, or multi-seed stats, so it's hard to know if they are reliable. The transfer result on MATH-500 is only 26%, which suggests the method may not generalize easily. The work stays within math word problems, so broader claims about CoT hijacks would need more testing.

This paper is aimed at mechanistic interpretability researchers who use patching and want to make their localization claims more statistically defensible. A reader working on similar auditing protocols would get value from the protocol design and the sample complexity result. It deserves a serious referee because it addresses a methodological issue with a concrete proposal, even though the current version needs more detail on the proofs and stronger experimental validation.

I would recommend sending it to peer review with requests for the full theorem proof and additional controls in the experiments.

Referee Report

1 major / 3 minor

Summary. The paper claims that clean-only activation patching profiles underidentify the source contrast in CoT answer-hijack localization for 7B-8B LLMs on GSM8K/MATH-500, rendering them unsound as mediation certificates. It introduces a three-stage auditing protocol (SELECT with permutation calibration and held-out validation, FREEZE to lock the hook, AUDIT with paired-bootstrap contrasts) that issues source-control certificates with Type-I error control: the probability of an incorrect mechanism label is at most alpha = alpha_sel + alpha_audit under sample-split disjointness. A matching-rate sample-complexity theorem states n_star = Theta(Delta^{-2} log(1/alpha)). Experiments report confirmatory localization in three cells with held-out gaps of +32.6, +45.1, +17.7; recovery rates of 47.0% (Qwen-puzzle) and 39.0% (Llama3-puzzle) at n=100; 26.0% MATH-500 transfer; and post-audit characterizations of the passing cells.

Significance. If the Type-I guarantee and sample-complexity bound hold, the work supplies a statistically grounded auditing protocol that could elevate standards in mechanistic interpretability by replacing heuristic localization with certified source control. The explicit error-rate decomposition and practical recovery numbers on two model families are concrete strengths; the method is positioned as a diagnostic rather than a defense, which aligns with its scope.

major comments (1)

[Abstract / Theorem statement] The central Type-I guarantee (probability of incorrect label ≤ alpha_sel + alpha_audit) and the sample-complexity theorem n_star = Theta(Delta^{-2} log(1/alpha)) are stated in the abstract and introduction, but the manuscript supplies neither the derivation steps nor the explicit assumptions (e.g., conditions on permutation calibration validity and the union bound under disjoint SELECT/AUDIT splits) needed to substantiate them. This is load-bearing for the primary claim.

minor comments (3)

[Experiments] Recovery percentages (47.0%, 39.0%, 26.0%) are reported without error bars, confidence intervals, or baseline comparisons against random or non-audited patching; adding these would strengthen the experimental section.
[Methods] The notation alpha_sel and alpha_audit is used before being defined; a single consolidated definition paragraph early in the methods would improve clarity.
[Experiments] The description of the three few-shot/puzzle cells that pass confirmatory K=1 localization would benefit from explicit listing of the exact prompts or puzzle templates used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the load-bearing nature of the Type-I guarantee and sample-complexity result. We agree that explicit derivation and assumptions are required and will revise the manuscript to include them.

read point-by-point responses

Referee: [Abstract / Theorem statement] The central Type-I guarantee (probability of incorrect label ≤ alpha_sel + alpha_audit) and the sample-complexity theorem n_star = Theta(Delta^{-2} log(1/alpha)) are stated in the abstract and introduction, but the manuscript supplies neither the derivation steps nor the explicit assumptions (e.g., conditions on permutation calibration validity and the union bound under disjoint SELECT/AUDIT splits) needed to substantiate them. This is load-bearing for the primary claim.

Authors: We agree with the referee that the derivation steps and explicit assumptions must be supplied. The current manuscript states the Type-I bound alpha = alpha_sel + alpha_audit and the matching-rate theorem but omits the full proof. In revision we will insert a new subsection (in Methods or Appendix) that (i) derives the error decomposition from the disjoint SELECT/AUDIT splits, (ii) states the exchangeability assumption required for permutation calibration to control alpha_sel, (iii) applies the union bound across the two stages, and (iv) specifies the conditions (bounded variance of the bootstrap contrast, minimum effect size Delta) under which n_star = Theta(Delta^{-2} log(1/alpha)) holds. This directly substantiates the primary claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard statistical primitives

full rationale

The paper's central claim is a Type-I error bound alpha = alpha_sel + alpha_audit for a three-stage SELECT-FREEZE-AUDIT procedure that uses permutation calibration in SELECT and paired-bootstrap contrasts in AUDIT, together with sample-split disjointness. The matching-rate sample complexity n_star = Theta(Delta^{-2} log(1/alpha)) is the standard form of a concentration inequality and is not derived from any fitted parameter or self-referential definition inside the paper. No self-citation is invoked as a load-bearing uniqueness theorem, no ansatz is smuggled via prior work, and no prediction is obtained by renaming a fitted input. The procedure is therefore self-contained against external statistical benchmarks (permutation tests and bootstrap) whose validity does not reduce to quantities constructed within the present manuscript.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The protocol rests on standard statistical assumptions for permutation and bootstrap tests plus the new construct of a source-control certificate; no free parameters are fitted to the target result in the abstract description.

free parameters (2)

alpha
Pre-registered Type-I error level used to bound the overall false-positive rate.
Delta
Effect-size parameter appearing in the sample-complexity bound.

axioms (2)

standard math Independence assumptions required for permutation calibration and paired bootstrap to control Type-I error
Invoked to justify that the SELECT and AUDIT stages together bound the error at alpha_sel + alpha_audit.
domain assumption Sample splits between SELECT and AUDIT stages remain disjoint
Required for the additive error bound to hold.

invented entities (1)

source-control certificate no independent evidence
purpose: To certify that a frozen activation patch controls the intended source contrast with Type-I guarantee
New auditing object introduced by the paper; no independent evidence outside the protocol itself is described.

pith-pipeline@v0.9.1-grok · 5915 in / 1511 out tokens · 31407 ms · 2026-06-28T05:58:16.677348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 8 linked inside Pith

[1]

arXiv preprint arXiv:2510.26418 , year=

Chain-of-Thought Hijacking , author=. arXiv preprint arXiv:2510.26418 , year=

Pith/arXiv arXiv
[2]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in. 2022 , note=

2022
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=
[6]

International Conference on Learning Representations (ICLR) , year=

Towards Understanding Sycophancy in Language Models , author =. International Conference on Learning Representations (ICLR) , year=
[7]

arXiv preprint arXiv:2308.10248 , year=

Steering Language Models With Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv
[8]

Transformer Circuits Thread , year=

In-context Learning and Induction Heads , author =. Transformer Circuits Thread , year=
[9]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[11]

International Conference on Learning Representations (ICLR) , year=

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations (ICLR) , year=
[12]

arXiv preprint arXiv:2310.01405 , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv
[13]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and others , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , note=

2023
[14]

International Conference on Machine Learning (ICML) , year=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author =. International Conference on Machine Learning (ICML) , year=
[15]

arXiv preprint arXiv:2412.15115 , year=

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv
[16]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[17]

arXiv preprint arXiv:2407.21783 , year=

The. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Improving Alignment and Robustness with Circuit Breakers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=
[19]

arXiv preprint arXiv:2307.13702 , year=

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2410.06672 , year=

Studying Mechanistic Similarity Across Language Model Architectures , author=. arXiv preprint arXiv:2410.06672 , year=

arXiv
[21]

Advances in Neural Information Processing Systems (NeurIPS) Position Paper Track , year=

Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces! , author=. Advances in Neural Information Processing Systems (NeurIPS) Position Paper Track , year=
[22]

arXiv preprint arXiv:2507.16407 , year =

Lin, Shuzheng and Du, Xiaodong and Wang, Tao and others , title =. arXiv preprint arXiv:2507.16407 , year =

arXiv
[23]

arXiv preprint arXiv:2307.15043 , year =

Zou, Andy and Wang, Zifan and Carlini, Nicholas and others , title =. arXiv preprint arXiv:2307.15043 , year =

Pith/arXiv arXiv
[24]

Knowledge Editing in Language Models , author=

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[25]

2026 , note=

Huang, Zhengxian and Zhu, Wenjun and Qiu, Haoxuan and Ji, Xiaoyu and Xu, Wenyuan , journal=. 2026 , note=

2026
[26]

arXiv preprint arXiv:2603.28817 , year=

Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models , author=. arXiv preprint arXiv:2603.28817 , year=

arXiv
[27]

2025 , note=

Zhang, Shenyi and others , booktitle=. 2025 , note=

2025
[28]

arXiv preprint arXiv:2601.15801 , year=

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models , author =. arXiv preprint arXiv:2601.15801 , year=

arXiv
[29]

Not Just

Kumarappan, Adarsh and Mujoo, Ananya , journal=. Not Just. 2026 , note=

2026
[30]

arXiv preprint arXiv:2501.16497 , year=

Smoothed Embeddings for Robust Language Models , author=. arXiv preprint arXiv:2501.16497 , year=

arXiv
[31]

arXiv preprint arXiv:2508.02087 , year=

Uncovering the Internal Origins of Sycophancy in Large Language Models , author=. arXiv preprint arXiv:2508.02087 , year=

arXiv
[32]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[33]

, booktitle=

Wu, Zhengxuan and Geiger, Atticus and Icard, Thomas and Potts, Christopher and Goodman, Noah D. , booktitle=. Interpretability at Scale: Identifying Causal Mechanisms in

[1] [1]

arXiv preprint arXiv:2510.26418 , year=

Chain-of-Thought Hijacking , author=. arXiv preprint arXiv:2510.26418 , year=

Pith/arXiv arXiv

[2] [2]

Locating and Editing Factual Associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in. 2022 , note=

2022

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

[4] [4]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

[6] [6]

International Conference on Learning Representations (ICLR) , year=

Towards Understanding Sycophancy in Language Models , author =. International Conference on Learning Representations (ICLR) , year=

[7] [7]

arXiv preprint arXiv:2308.10248 , year=

Steering Language Models With Activation Engineering , author =. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv

[8] [8]

Transformer Circuits Thread , year=

In-context Learning and Induction Heads , author =. Transformer Circuits Thread , year=

[9] [9]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[10] [10]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[11] [11]

International Conference on Learning Representations (ICLR) , year=

Mass-Editing Memory in a Transformer , author =. International Conference on Learning Representations (ICLR) , year=

[12] [12]

arXiv preprint arXiv:2310.01405 , year=

Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv

[13] [13]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Wang, Kevin Ro and Variengien, Alexandre and Conmy, Arthur and others , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , note=

2023

[14] [14]

International Conference on Machine Learning (ICML) , year=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author =. International Conference on Machine Learning (ICML) , year=

[15] [15]

arXiv preprint arXiv:2412.15115 , year=

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv

[16] [16]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[17] [17]

arXiv preprint arXiv:2407.21783 , year=

The. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[18] [18]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Improving Alignment and Robustness with Circuit Breakers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year=

[19] [19]

arXiv preprint arXiv:2307.13702 , year=

Measuring Faithfulness in Chain-of-Thought Reasoning , author =. arXiv preprint arXiv:2307.13702 , year=

Pith/arXiv arXiv

[20] [20]

arXiv preprint arXiv:2410.06672 , year=

Studying Mechanistic Similarity Across Language Model Architectures , author=. arXiv preprint arXiv:2410.06672 , year=

arXiv

[21] [21]

Advances in Neural Information Processing Systems (NeurIPS) Position Paper Track , year=

Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces! , author=. Advances in Neural Information Processing Systems (NeurIPS) Position Paper Track , year=

[22] [22]

arXiv preprint arXiv:2507.16407 , year =

Lin, Shuzheng and Du, Xiaodong and Wang, Tao and others , title =. arXiv preprint arXiv:2507.16407 , year =

arXiv

[23] [23]

arXiv preprint arXiv:2307.15043 , year =

Zou, Andy and Wang, Zifan and Carlini, Nicholas and others , title =. arXiv preprint arXiv:2307.15043 , year =

Pith/arXiv arXiv

[24] [24]

Knowledge Editing in Language Models , author=

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[25] [25]

2026 , note=

Huang, Zhengxian and Zhu, Wenjun and Qiu, Haoxuan and Ji, Xiaoyu and Xu, Wenyuan , journal=. 2026 , note=

2026

[26] [26]

arXiv preprint arXiv:2603.28817 , year=

Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models , author=. arXiv preprint arXiv:2603.28817 , year=

arXiv

[27] [27]

2025 , note=

Zhang, Shenyi and others , booktitle=. 2025 , note=

2025

[28] [28]

arXiv preprint arXiv:2601.15801 , year=

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models , author =. arXiv preprint arXiv:2601.15801 , year=

arXiv

[29] [29]

Not Just

Kumarappan, Adarsh and Mujoo, Ananya , journal=. Not Just. 2026 , note=

2026

[30] [30]

arXiv preprint arXiv:2501.16497 , year=

Smoothed Embeddings for Robust Language Models , author=. arXiv preprint arXiv:2501.16497 , year=

arXiv

[31] [31]

arXiv preprint arXiv:2508.02087 , year=

Uncovering the Internal Origins of Sycophancy in Large Language Models , author=. arXiv preprint arXiv:2508.02087 , year=

arXiv

[32] [32]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[33] [33]

, booktitle=

Wu, Zhengxuan and Geiger, Atticus and Icard, Thomas and Potts, Christopher and Goodman, Noah D. , booktitle=. Interpretability at Scale: Identifying Causal Mechanisms in