How Transparent is DiffusionGemma?

Arthur Conmy; Asic Q Chen; Bilal Chughtai; Brendan O'Donoghue; Callum McDougall; Cindy Wu; Janos Kramar; Jean Tarbouriech; Jo\~ao Gabriel Lopes de Oliveira; Joshua Engels

arxiv: 2606.20560 · v1 · pith:4GPA3BQ5new · submitted 2026-06-18 · 💻 cs.LG · cs.AI

How Transparent is DiffusionGemma?

Joshua Engels , Callum McDougall , Bilal Chughtai , Janos Kramar , Senthoran Rajamanoharan , Cindy Wu , Arthur Conmy , Asic Q Chen

show 6 more authors

Jean Tarbouriech Min Ma Brendan O'Donoghue Jo\~ao Gabriel Lopes de Oliveira Rohin Shah Neel Nanda

This is my paper

Pith reviewed 2026-06-26 18:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion modelsmodel interpretabilitytransparencydenoising stepsmonitorabilityautoregressive modelstoken bottleneck

0 comments

The pith

DiffusionGemma matches Gemma 4 transparency once information between denoising steps passes through an interpretable token bottleneck.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether DiffusionGemma's heavier use of continuous latent space makes its reasoning less transparent than the autoregressive Gemma 4. It splits transparency into variable transparency, which concerns readable intermediate states, and algorithmic transparency, which concerns reconstructing the steps that produced an output. By routing the flow between denoising steps through a fixed token bottleneck the authors recover variable transparency with no measured drop in task performance, cutting the opaque serial depth from 28.6 times to 1.1 times that of Gemma 4. Case studies then surface diffusion-specific patterns such as non-chronological reasoning and token smearing. Finally, the paper shows that monitorability, the practical test of whether model states help downstream detection tasks, stays comparable to the autoregressive baseline.

Core claim

Although DiffusionGemma performs a larger fraction of its work in continuous space and therefore appears to have 28.6 times the opaque serial depth of Gemma 4, the information that must travel between successive denoising steps can be forced through an interpretable token bottleneck without loss of downstream performance. This mapping reduces the effective opaque depth to 1.1 times the autoregressive figure. Algorithmic transparency is harder because every token on the canvas can be revised at every step, yet the same bottleneck states already suffice for the monitorability tasks examined.

What carries the argument

The interpretable token bottleneck that extracts and re-injects discrete token information between successive denoising steps

If this is right

Variable transparency of DiffusionGemma becomes comparable to Gemma 4 once the bottleneck states are treated as readable.
Monitorability for downstream detection tasks remains essentially unchanged between the diffusion and autoregressive models.
Algorithmic transparency stays more difficult because the model can revise every token at every step.
Case studies already reveal diffusion-specific behaviors including non-chronological reasoning and sequence smearing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bottleneck technique could be tested on other diffusion language models to check whether the 1.1-times depth reduction generalizes.
The newly observed phenomena such as token smearing could be probed for effects on safety or alignment benchmarks.
Quantifying how much computation still occurs outside the bottleneck would clarify the remaining opaque fraction.

Load-bearing premise

Routing information through an interpretable token bottleneck at each denoising step preserves the original computation and downstream performance without introducing artifacts that change the model's effective algorithm.

What would settle it

An experiment that inserts the token bottleneck into every denoising step and then measures a statistically significant drop in accuracy or change in output distribution on a standard benchmark would falsify the transparency claim.

read the original abstract

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffusionGemma reaches near-parity transparency with autoregressive models by routing through a token bottleneck, plus some diffusion-specific behaviors.

read the letter

The main takeaway is that mapping information between denoising steps through an interpretable token bottleneck cuts the opaque serial depth from 28.6X down to 1.1X relative to Gemma 4, with no reported drop in downstream performance. They also surface some behaviors that look specific to diffusion, like non-chronological reasoning and smearing.

The split between variable transparency and algorithmic transparency is a clear way to organize the comparison. The serial depth metric gives a concrete number instead of vague claims, and the case studies actually show examples rather than stopping at the difficulty of changing tokens at every step. The monitorability result landing close to the autoregressive baseline is the most immediately usable part for anyone thinking about oversight.

The algorithmic transparency side still looks limited, and the paper treats the evidence there as initial. The bottleneck claim rests on performance staying flat, which addresses the obvious risk of changing the computation, but more detail on how the mapping was constructed and tested would make the preservation argument tighter. No obvious circularity or self-referential fitting shows up in the abstract.

This is for interpretability people working on non-autoregressive generators or safety applications that need monitorability. A reader who wants a direct head-to-head on transparency metrics will get something concrete. It deserves peer review because the central comparison is testable and the new phenomena are specific enough to check.

Referee Report

2 major / 2 minor

Summary. The paper claims that DiffusionGemma's variable transparency can be made comparable to the autoregressive Gemma 4 by routing information through an interpretable token bottleneck at each denoising step, reducing opaque serial depth from 28.6X to 1.1X with no decrease in downstream performance. It decomposes transparency into variable and algorithmic components, presents case studies identifying diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning, and reports that DiffusionGemma exhibits similar monitorability to Gemma 4.

Significance. If the token-bottleneck result holds, the work would show that diffusion LLMs need not incur a transparency penalty relative to autoregressive models and would supply concrete examples of diffusion-specific reasoning behaviors that could guide future interpretability research. The monitorability finding is directly relevant to safety and oversight applications.

major comments (2)

[Abstract and Variable Transparency section] Abstract and Variable Transparency section: the claim that the interpretable token bottleneck mapping reduces opaque serial depth to 1.1X 'with no decrease in downstream performance' is load-bearing for the central variable-transparency result; the manuscript must report the exact datasets, performance metrics with error bars, ablation controls, and statistical tests supporting preservation, as none are supplied in the abstract.
[Algorithmic Transparency section] Algorithmic Transparency section: the assertion that algorithmic transparency is harder for diffusion models because 'all token predictions in the canvas can change at every denoising step' is used to motivate the case studies; the paper should supply a quantitative comparison (e.g., a metric of distributed computation) showing this property is materially stronger in DiffusionGemma than in Gemma 4.

minor comments (2)

[Abstract] Abstract: the term 'opaque serial depth' is used without an inline definition or pointer to its precise calculation.
[Monitorability paragraph] Monitorability paragraph: the statement that DiffusionGemma is 'similarly monitorable' should name the exact downstream-task metric and report whether the similarity is statistically significant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the presentation of our central results. We address each major comment below and commit to revisions that improve clarity and evidentiary support without altering the manuscript's core claims.

read point-by-point responses

Referee: [Abstract and Variable Transparency section] Abstract and Variable Transparency section: the claim that the interpretable token bottleneck mapping reduces opaque serial depth to 1.1X 'with no decrease in downstream performance' is load-bearing for the central variable-transparency result; the manuscript must report the exact datasets, performance metrics with error bars, ablation controls, and statistical tests supporting preservation, as none are supplied in the abstract.

Authors: We agree that the abstract omits these supporting details due to length constraints. The Variable Transparency section of the full manuscript contains the relevant performance comparisons, but to directly address the concern we will expand the section with explicit dataset names, metrics accompanied by error bars from multiple runs, ablation controls for the bottleneck, and statistical tests confirming no significant performance drop. We will also insert a concise reference to these results in the abstract. revision: yes
Referee: [Algorithmic Transparency section] Algorithmic Transparency section: the assertion that algorithmic transparency is harder for diffusion models because 'all token predictions in the canvas can change at every denoising step' is used to motivate the case studies; the paper should supply a quantitative comparison (e.g., a metric of distributed computation) showing this property is materially stronger in DiffusionGemma than in Gemma 4.

Authors: The statement describes an inherent mechanistic difference: diffusion maintains and revises an entire canvas at each step, whereas autoregressive generation fixes tokens once emitted. This difference is used to motivate the case studies on diffusion-specific behaviors. While the current manuscript presents this qualitatively, we acknowledge a quantitative metric would strengthen the claim. We will add a simple metric (e.g., average token revision rate across denoising steps versus zero revisions possible in the autoregressive baseline) to the Algorithmic Transparency section in revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical measurements of variable transparency, algorithmic transparency, and monitorability via case studies and downstream benchmarks. No equations, fitted parameters, or self-citations are described that reduce any reported metric or claim to a definitional equivalence with its inputs. The token-bottleneck mapping is presented as an empirical intervention whose performance preservation is externally validated rather than assumed by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about the fidelity of the token bottleneck mapping.

pith-pipeline@v0.9.1-grok · 5878 in / 1122 out tokens · 21610 ms · 2026-06-26T18:21:23.350646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 1 canonical work pages

[1]

Pichai, Sundar and Hassabis, Demis , title =
[2]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv
[3]

First conference on language modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First conference on language modeling , year=
[4]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2512.18311 , year=

Monitoring monitorability , author=. arXiv preprint arXiv:2512.18311 , year=

arXiv
[6]

2025 , eprint=

A Survey on Latent Reasoning , author=. 2025 , eprint=

2025
[7]

2026 , url=

Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning , author=. 2026 , url=

2026
[8]

[Year] , url=

Measuring no-CoT math time horizon (single forward pass) , author=. [Year] , url=
[9]

2026 , eprint=

Measuring AI Ability to Complete Long Software Tasks , author=. 2026 , eprint=

2026
[10]

2025 , url=

Recent LLMs can use filler tokens or problem repeats to , author=. 2025 , url=

2025
[11]

arXiv preprint arXiv:1807.03819 , year=

Universal Transformers , author=. arXiv preprint arXiv:1807.03819 , year=

Pith/arXiv arXiv
[12]

2025 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=

2025
[13]

2023 , eprint=

Any-to-Any Generation via Composable Diffusion , author=. 2023 , eprint=

2023
[14]

2025 , month=

Can we interpret latent reasoning using current mechanistic interpretability tools? , author=. 2025 , month=

2025
[15]

2025 , url=

Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models , author=. 2025 , url=

2025
[16]

2026 , eprint=

Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks , author=. 2026 , eprint=

2026
[17]

arXiv preprint arXiv:2211.00593 , year=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

Pith/arXiv arXiv
[18]

2025 , url=

13 arguments about a transition to neuralese AIs , author=. 2025 , url=

2025
[19]

arXiv preprint arXiv:2507.11473 , year=

Chain of thought monitorability: A new and fragile opportunity for ai safety , author=. arXiv preprint arXiv:2507.11473 , year=

Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2507.05246 , year=

When chain of thought is necessary, language models struggle to evade monitors , author=. arXiv preprint arXiv:2507.05246 , year=

arXiv
[21]

arXiv preprint arXiv:2505.23575 , year=

Cot red-handed: Stress testing chain-of-thought monitoring , author=. arXiv preprint arXiv:2505.23575 , year=

arXiv
[22]

arXiv preprint arXiv:2501.17315 , year=

A sketch of an AI control safety case , author=. arXiv preprint arXiv:2501.17315 , year=

arXiv
[23]

arXiv preprint arXiv:2603.09786 , year=

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth , author=. arXiv preprint arXiv:2603.09786 , year=

arXiv
[24]

arXiv preprint arXiv:2405.01470 , year=

Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

Pith/arXiv arXiv
[25]

2026 , eprint=

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth , author=. 2026 , eprint=

2026
[26]

2026 , eprint=

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? , author=. 2026 , eprint=

2026
[27]

2026 , month = apr, howpublished =

Mallen, Alex and Greenblatt, Ryan , title =. 2026 , month = apr, howpublished =

2026
[28]

2025 , eprint=

Stress Testing Deliberative Alignment for Anti-Scheming Training , author=. 2025 , eprint=

2025
[29]

2024 , url =

System Card: Claude Sonnet 4.5 , institution =. 2024 , url =

2024
[30]

2024 , url =

Gemini 3 Pro Frontier Safety Framework Report , institution =. 2024 , url =

2024
[31]

2023 , eprint=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. 2023 , eprint=

2023
[32]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[33]

Interpreting

nostalgebraist , howpublished=. Interpreting. 2020 , url=

2020
[34]

2025 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2025 , eprint=

2025
[35]

2024 , eprint=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author=. 2024 , eprint=

2024
[36]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[37]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023
[38]

2024 , journal=

Sparse Crosscoders for Cross-Layer Features and Model Diffing , author=. 2024 , journal=

2024
[39]

2024 , eprint=

How to use and interpret activation patching , author=. 2024 , eprint=

2024
[40]

2025 , eprint=

The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2025 , eprint=

2025
[41]

arXiv preprint arXiv:2602.10371 , year=

Simple LLM Baselines are Competitive for Model Diffing , author=. arXiv preprint arXiv:2602.10371 , year=

arXiv
[42]

2025 , eprint=

Thought Anchors: Which LLM Reasoning Steps Matter? , author=. 2025 , eprint=

2025
[43]

2026 , note=

Model Diffing Agents , author=. 2026 , note=

2026
[44]

Advances in Neural Information Processing Systems , volume=

Accelerated sampling from masked diffusion models via entropy bounded unmasking , author=. Advances in Neural Information Processing Systems , volume=
[45]

2026 , howpublished =

2026
[46]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[47]

arXiv preprint arXiv:2603.30036 , year=

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? , author=. arXiv preprint arXiv:2603.30036 , year=

arXiv
[48]

arXiv preprint arXiv:2606.07157 , year=

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models , author=. arXiv preprint arXiv:2606.07157 , year=

Pith/arXiv arXiv
[49]

2026 , eprint =

Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens , author =. 2026 , eprint =

2026
[50]

International Conference on Learning Representations , volume=

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora , author=. International Conference on Learning Representations , volume=
[51]

arXiv preprint arXiv:2402.13572 , year=

Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=

arXiv
[52]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...
[53]

2023 , month=

Hubinger, Evan and Schiefer, Nicholas and Denison, Carson and Perez, Ethan , title=. 2023 , month=

2023
[54]

arXiv preprint arXiv:2512.15674 , year=

Activation oracles: Training and evaluating llms as general-purpose activation explainers , author=. arXiv preprint arXiv:2512.15674 , year=

arXiv
[55]

2026 , month =

Frontier Risk Report (February to March 2026) , author =. 2026 , month =

2026
[56]

2026 , month = feb, url =

Risk Report:. 2026 , month = feb, url =

2026
[57]

arXiv preprint arXiv:2505.03574 , year=

Llamafirewall: An open source guardrail system for building secure ai agents , author=. arXiv preprint arXiv:2505.03574 , year=

arXiv
[58]

arXiv preprint arXiv:2503.11926 , year=

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv

[1] [1]

Pichai, Sundar and Hassabis, Demis , title =

[2] [2]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv

[3] [3]

First conference on language modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First conference on language modeling , year=

[4] [4]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2512.18311 , year=

Monitoring monitorability , author=. arXiv preprint arXiv:2512.18311 , year=

arXiv

[6] [6]

2025 , eprint=

A Survey on Latent Reasoning , author=. 2025 , eprint=

2025

[7] [7]

2026 , url=

Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning , author=. 2026 , url=

2026

[8] [8]

[Year] , url=

Measuring no-CoT math time horizon (single forward pass) , author=. [Year] , url=

[9] [9]

2026 , eprint=

Measuring AI Ability to Complete Long Software Tasks , author=. 2026 , eprint=

2026

[10] [10]

2025 , url=

Recent LLMs can use filler tokens or problem repeats to , author=. 2025 , url=

2025

[11] [11]

arXiv preprint arXiv:1807.03819 , year=

Universal Transformers , author=. arXiv preprint arXiv:1807.03819 , year=

Pith/arXiv arXiv

[12] [12]

2025 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=

2025

[13] [13]

2023 , eprint=

Any-to-Any Generation via Composable Diffusion , author=. 2023 , eprint=

2023

[14] [14]

2025 , month=

Can we interpret latent reasoning using current mechanistic interpretability tools? , author=. 2025 , month=

2025

[15] [15]

2025 , url=

Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models , author=. 2025 , url=

2025

[16] [16]

2026 , eprint=

Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks , author=. 2026 , eprint=

2026

[17] [17]

arXiv preprint arXiv:2211.00593 , year=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

Pith/arXiv arXiv

[18] [18]

2025 , url=

13 arguments about a transition to neuralese AIs , author=. 2025 , url=

2025

[19] [19]

arXiv preprint arXiv:2507.11473 , year=

Chain of thought monitorability: A new and fragile opportunity for ai safety , author=. arXiv preprint arXiv:2507.11473 , year=

Pith/arXiv arXiv

[20] [20]

arXiv preprint arXiv:2507.05246 , year=

When chain of thought is necessary, language models struggle to evade monitors , author=. arXiv preprint arXiv:2507.05246 , year=

arXiv

[21] [21]

arXiv preprint arXiv:2505.23575 , year=

Cot red-handed: Stress testing chain-of-thought monitoring , author=. arXiv preprint arXiv:2505.23575 , year=

arXiv

[22] [22]

arXiv preprint arXiv:2501.17315 , year=

A sketch of an AI control safety case , author=. arXiv preprint arXiv:2501.17315 , year=

arXiv

[23] [23]

arXiv preprint arXiv:2603.09786 , year=

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth , author=. arXiv preprint arXiv:2603.09786 , year=

arXiv

[24] [24]

arXiv preprint arXiv:2405.01470 , year=

Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

Pith/arXiv arXiv

[25] [25]

2026 , eprint=

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth , author=. 2026 , eprint=

2026

[26] [26]

2026 , eprint=

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? , author=. 2026 , eprint=

2026

[27] [27]

2026 , month = apr, howpublished =

Mallen, Alex and Greenblatt, Ryan , title =. 2026 , month = apr, howpublished =

2026

[28] [28]

2025 , eprint=

Stress Testing Deliberative Alignment for Anti-Scheming Training , author=. 2025 , eprint=

2025

[29] [29]

2024 , url =

System Card: Claude Sonnet 4.5 , institution =. 2024 , url =

2024

[30] [30]

2024 , url =

Gemini 3 Pro Frontier Safety Framework Report , institution =. 2024 , url =

2024

[31] [31]

2023 , eprint=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. 2023 , eprint=

2023

[32] [32]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[33] [33]

Interpreting

nostalgebraist , howpublished=. Interpreting. 2020 , url=

2020

[34] [34]

2025 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2025 , eprint=

2025

[35] [35]

2024 , eprint=

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models , author=. 2024 , eprint=

2024

[36] [36]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[37] [37]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023

[38] [38]

2024 , journal=

Sparse Crosscoders for Cross-Layer Features and Model Diffing , author=. 2024 , journal=

2024

[39] [39]

2024 , eprint=

How to use and interpret activation patching , author=. 2024 , eprint=

2024

[40] [40]

2025 , eprint=

The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2025 , eprint=

2025

[41] [41]

arXiv preprint arXiv:2602.10371 , year=

Simple LLM Baselines are Competitive for Model Diffing , author=. arXiv preprint arXiv:2602.10371 , year=

arXiv

[42] [42]

2025 , eprint=

Thought Anchors: Which LLM Reasoning Steps Matter? , author=. 2025 , eprint=

2025

[43] [43]

2026 , note=

Model Diffing Agents , author=. 2026 , note=

2026

[44] [44]

Advances in Neural Information Processing Systems , volume=

Accelerated sampling from masked diffusion models via entropy bounded unmasking , author=. Advances in Neural Information Processing Systems , volume=

[45] [45]

2026 , howpublished =

2026

[46] [46]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[47] [47]

arXiv preprint arXiv:2603.30036 , year=

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? , author=. arXiv preprint arXiv:2603.30036 , year=

arXiv

[48] [48]

arXiv preprint arXiv:2606.07157 , year=

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models , author=. arXiv preprint arXiv:2606.07157 , year=

Pith/arXiv arXiv

[49] [49]

2026 , eprint =

Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens , author =. 2026 , eprint =

2026

[50] [50]

International Conference on Learning Representations , volume=

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora , author=. International Conference on Learning Representations , volume=

[51] [51]

arXiv preprint arXiv:2402.13572 , year=

Algoformer: An efficient transformer framework with algorithmic structures , author=. arXiv preprint arXiv:2402.13572 , year=

arXiv

[52] [52]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...

[53] [53]

2023 , month=

Hubinger, Evan and Schiefer, Nicholas and Denison, Carson and Perez, Ethan , title=. 2023 , month=

2023

[54] [54]

arXiv preprint arXiv:2512.15674 , year=

Activation oracles: Training and evaluating llms as general-purpose activation explainers , author=. arXiv preprint arXiv:2512.15674 , year=

arXiv

[55] [55]

2026 , month =

Frontier Risk Report (February to March 2026) , author =. 2026 , month =

2026

[56] [56]

2026 , month = feb, url =

Risk Report:. 2026 , month = feb, url =

2026

[57] [57]

arXiv preprint arXiv:2505.03574 , year=

Llamafirewall: An open source guardrail system for building secure ai agents , author=. arXiv preprint arXiv:2505.03574 , year=

arXiv

[58] [58]

arXiv preprint arXiv:2503.11926 , year=

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation , author=. arXiv preprint arXiv:2503.11926 , year=

Pith/arXiv arXiv