Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

Benjamin Shih; Eric Darve; John Winnicki

arxiv: 2606.29522 · v1 · pith:JNQD52AFnew · submitted 2026-06-28 · 💻 cs.LG · cs.CL

Do Models Read What They Write? Causal Registers in Scratchpad Reasoning

Benjamin Shih , John Winnicki , Eric Darve This is my paper

Pith reviewed 2026-06-30 07:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords scratchpad reasoningcausal interventionprocess supervisionstate trackinglanguage model interpretabilityintermediate computationinternal representation editing

0 comments

The pith

Models trained to write scratchpad states predict the downstream effects of edits to those states far more often than models that skip writing intermediates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether scratchpad writing in language models creates states that later computation actually depends on. It uses a state-tracking task with a known update rule and compares models trained to output intermediate states against controls that only give final answers. At test time it edits the internal representation of one written state while the visible text stays fixed, then checks whether the model follows the single correct next step implied by the edit. The state-writing model does so on 80% and 91% of held-out examples across two variants, while pretrained and final-answer-only models stay near chance. Additional checks confirm the prediction requires both the edited state and the current move.

Core claim

In Qwen2.5-Coder-7B and other families, a model trained to write intermediate states before the final answer follows the next phase bit implied by an edited internal state representation on 80% and 91% of held-out examples, whereas models trained only on final answers or left pretrained remain near baseline. The dependence holds after controls for generic next-token steering and for copying another continuation, and it requires both the edited state and the current move.

What carries the argument

Causal intervention that edits the internal representation of one written state while the visible scratchpad text remains unchanged, measured against a known transition rule in a state-tracking task.

If this is right

Process supervision can produce written states that the model treats as inputs to its own later steps.
The same causal-use pattern appears across multiple model families.
Oversight of scratchpads should aim to train states that are both legible and actually used in computation.
The effect is specific to the edited state and the current move rather than generic continuation steering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that reward causal use of written states could be added to standard process-supervision pipelines.
The same editing technique might be applied to more open-ended reasoning tasks to test whether causal registers appear outside controlled state tracking.
If written states are causally active, methods that monitor or edit those states could directly influence final outputs without changing the visible text.

Load-bearing premise

Editing the internal representation of one written state while leaving the visible scratchpad text fixed isolates the causal dependence of later computation on that written state rather than on other internal variables.

What would settle it

If the state-writing model, after the internal edit, predicts the next phase bit at rates no higher than the final-answer-only control across the held-out examples in either task variant.

Figures

Figures reproduced from arXiv: 2606.29522 by Benjamin Shih, Eric Darve, John Winnicki.

**Figure 1.** Figure 1: The task isolates an order-dependent state variable. The phase bit is printed in the running state, but it is not determined by the visible coordinate alone. We edit the highlighted phase-bit representation while leaving the printed token fixed. 4 Editing the state at a scratchpad token The intervention is designed to distinguish three possibilities that ordinary accuracy cannot separate. The phase bit may… view at source ↗

**Figure 2.** Figure 2: shows the counterfactual scored by the intervention. The text names the original state s, while the residual-stream feature at the current phase token is overwritten with the same-visible state s˜. Because the upcoming move is unchanged, the two branches have a single discriminating target: a model that computes from the patched representation at the current-state site should predict m · s˜, while a model … view at source ↗

**Figure 3.** Figure 3: Transition-rule consistency controls. Each row compares the edited-state target with the strongest matched alternative for that test. Use compares the state edit with random or orthogonal edits; move-specific compares the same edited state under an alternate move; computed-not-copied compares following the current move with following the injected source future. Positive gaps mean the edit behaves like a st… view at source ↗

**Figure 4.** Figure 4: Ablation and restoration validate the update routes. Counterfactual-update selectivity is shown under the intact edit, route ablation, matched control, and restoration. If a route carries the edited-state effect, ablating it should reduce selectivity while the matched control and restoration stay near the intact value. This is nearly complete for one Q8 edge and partial for the fixed four-edge D8 route. Fu… view at source ↗

**Figure 5.** Figure 5: Same moves, same visible path, different phase. The move sequence T Q T Q T T from 00|0 produces the same visible coordinates in Q8 and D8, but the order-sensitive phase bit diverges where boxed. This is a longer example of the row-dependent phase update in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: How the Q8 and D8 phase updates differ. The eight states are drawn as two phase layers over the four visible coordinates. The move Q has the same phase-flip pattern in the two systems. The move T crosses phase layers in Q8 but preserves phase in D8. The coral path traces the orbit of move T from 00|0: it threads both phase layers in Q8 (order 4) but stays within one in D8 (order 2). Algebraic construction.… view at source ↗

**Figure 7.** Figure 7: Linear-probe readout of the next-state phase coordinate by layer. Probe accuracy is shown [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Compact state feature, variable update route. Left: random and orthogonal controls stay near zero, while the edited-state effect reaches full strength by rank 1 in Q8 and rank 2 in D8, far below the residual-stream dimension. Right: the route that uses this compact feature is concentrated in Q8, where one edge removes nearly all of the effect, and more distributed in D8, where reduction accumulates over a … view at source ↗

**Figure 9.** Figure 9: Candidate-scan landscape for the update path. Each bar is one candidate component from the fixed scan, ranked by how much its ablation reduces Counterfactual Update Selectivity. Q8 has an isolated layer-22 current-phase edge. D8 has no single dominant edge, motivating the frozen four-edge route analyzed in Tables 8 and 9. Q8 localization (layer-22 edge) baseline ablation matched control restoration CUS, sp… view at source ↗

**Figure 10.** Figure 10: The D8 four-edge route is unusually strong among same-size edge sets. Distribution of CUS removed by 1000 random four-edge sets drawn from the 32 literal attention edges in the candidate landscape. The frozen route E4 removes 0.532 and lies at the 99.6th percentile; the null median is 0.055 and the 95th percentile is 0.254. A single random set removes more than E4 , so the claim is percentile strength and… view at source ↗

read the original abstract

A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a state, later steps should compute from that state. To test this requirement, we use a controlled state-tracking task with a known update rule, comparing models trained to report only the final state with models trained to write intermediate states before giving the final answer. At evaluation, we edit the internal representation of one written state while leaving the visible scratchpad text fixed. Because the transition rule is known, the edit has a single correct downstream consequence. In Qwen2.5-Coder-7B, the state-writing model predicts the next phase bit implied by the edited state on 80% and 91% of held-out examples across the two task variants, while pretrained and final-answer-only controls remain near baseline. Additional controls rule out generic next-token steering and copying another continuation: the prediction depends on both the edited state and the current move. The same causal-use pattern replicates across model families. Together, these results suggest a sharper goal for scratchpad oversight: not just to make intermediate reasoning legible, but to train written states that the model uses as part of its computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that state-writing training can make models causally depend on their own scratchpad states in a controlled task, with internal edits producing the expected downstream effects.

read the letter

The core result is that models trained to write intermediate states on a state-tracking task respond to targeted internal edits of those states by shifting their next predictions accordingly, while controls that skip the scratchpad or are pretrained do not. The edit keeps the visible text fixed, so the change comes from the model's internal use of the written state rather than surface copying.

What stands out is the clean setup: a known update rule lets them define the single correct consequence of each edit, and they add controls that rule out generic next-token steering or just copying a different continuation. The effect appears in the state-writing models at 80-91% on held-out cases and replicates across model families. That is a direct test of the causal link that process supervision hopes for.

The main limitation is the narrow task. State tracking with an explicit rule is far from the open-ended reasoning where scratchpads are usually applied, so it is still unclear whether the same causal dependence emerges under normal training on harder problems. The editing procedure itself is only sketched in the abstract, which leaves some questions about implementation details and statistical power.

This is worth attention for people working on process supervision and mechanistic interpretability. The design is thoughtful enough and the controls address the obvious confounds, so it deserves a serious referee even if the result needs broader testing.

Referee Report

0 major / 3 minor

Summary. The paper claims that models trained to write intermediate states in scratchpads causally use those states in later computation. Using state-tracking tasks with known update rules, state-writing models are compared to final-answer-only and pretrained controls. Interventions edit the internal representation of a written state (visible text fixed); the state-writing model then predicts the next phase bit implied by the edit on 80% and 91% of held-out examples in two variants for Qwen2.5-Coder-7B, while controls stay near baseline. Additional controls show the effect depends on both the edited state and current move; the pattern replicates across model families.

Significance. If the result holds, the work supplies direct evidence that process supervision can produce written states that are not merely legible but causally integrated into the model's computation. This is relevant for alignment techniques that rely on scratchpad oversight. The manuscript earns credit for its explicit controls against generic next-token steering and copying, plus replication across model families, which together address the isolation concern raised by the intervention design.

minor comments (3)

[§3] §3 (Methods): the exact layer, head, and token position at which the state representation is edited should be stated explicitly, as this detail is needed to assess whether the intervention truly targets only the written state.
[Table 1, Figure 3] Table 1 and Figure 3: report the exact number of held-out examples and any statistical test used for the 80% / 91% figures so readers can evaluate precision and variability.
[§5] §5 (Discussion): the claim that the result 'suggests a sharper goal for scratchpad oversight' would be strengthened by a short paragraph contrasting the observed causal-use rates with those expected under pure next-token prediction.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of the work and for the positive assessment, including the recommendation for minor revision. The report correctly identifies the core contribution regarding causal use of written states under interventions. No major comments were listed in the report, so we have no point-by-point responses to provide at this stage. We remain available to address any minor suggestions or clarifications during revision.

Circularity Check

0 steps flagged

No significant circularity; empirical intervention study

full rationale

The paper reports an empirical causal intervention experiment on scratchpad reasoning in LLMs. It trains models on state-tracking tasks, performs targeted edits to internal representations of written states, and measures downstream prediction accuracy against controls (pretrained, final-answer-only, and additional steering/copying controls). No equations, fitted parameters, ansatzes, or self-citations are used to derive the central claims; the results are direct experimental measurements on held-out examples across model families. The protocol is self-contained against external benchmarks and does not reduce any prediction to a definitional equivalence or prior self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies almost no information on modeling assumptions or parameters.

axioms (1)

domain assumption The transition rule is known and each edit therefore has exactly one correct downstream consequence.
Stated in the abstract as the basis for measuring whether the model follows the edited state.

pith-pipeline@v0.9.1-grok · 5759 in / 1199 out tokens · 36492 ms · 2026-06-30T07:24:07.521772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages · 7 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year =

J. Engels, E. J. Michaud, I. Liao, W. Gurnee, and M. Tegmark. Not all language model features are one-dimensionally linear. arXiv:2405.14860,

work page arXiv
[2]

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

S. Feucht, T. Haklay, U. Bhalla, et al. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts. arXiv:2605.01148,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

mlr.press/v119/kalatzis20a.html

S. Kantamneni and M. Tegmark. Language models use trigonometry to do addition. arXiv:2502.00873,

work page arXiv
[4]

Measuring Faithfulness in Chain-of-Thought Reasoning

T. Lanham, A. Chen, A. Radhakrishnan, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

M. Nye, A. J. Andreassen, G. Gur-Ari, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

In-context Learning and Induction Heads

C. Olsson, N. Elhage, N. Nanda, et al. In-context learning and induction heads. arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery. arXiv:2310.10348,

work page arXiv
[8]

11 A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering. arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

D. Wurgaft, C. Rager, M. Kowal, et al. Manifold steering reveals the shared geometry of neural network representation and behavior. arXiv:2605.05115,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

A. Zou, L. Phan, S. Chen, et al. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

These curves are representational diagnostics only; the causal-use claim rests on the intervention tests and controls. 0 5 10 15 20 25 Layer 0.5 0.6 0.7 0.8 0.9 1.0 Next-state readout decode accuracychance Q8 base ﬁnal-answer running-state 0 5 10 15 20 25 Layer D8 Figure 7: Linear-probe readout of the next-state phase coordinate by layer. Probe accuracy i...

2025
[12]

No single edge passes the single-edge criterion (the largest, L22, removes only 24 percent), so theD 8 update is distributed where theQ 8 update is concentrated

Across all ten splits both matched controls preserve the selectivity (the quotient-source and off-target destination controls stay near 0.82), the clean restoration recovers it, and ordinary unedited behavior is intact (P( ˆpt+1 =p(m·s)) = 0.88). No single edge passes the single-edge criterion (the largest, L22, removes only 24 percent), so theD 8 update ...

2000
[13]

Entries are mean CUS removed. Edge Alone Leave-one-out drop Layer 22, current phase bit 0.212 0.346 Layer 19, move token 0.113 0.021 Layer 25, current phase bit 0.078 0.039 Layer 23, current phase bit 0.069 0.258 Full routeE 4 (all four) 0.532 n/a 24 G Extended related work Scratchpads and faithfulness.Scratchpad and chain-of-thought methods show that ask...

2021

[1] [1]

International Conference on Learning Representations (ICLR) , year =

J. Engels, E. J. Michaud, I. Liao, W. Gurnee, and M. Tegmark. Not all language model features are one-dimensionally linear. arXiv:2405.14860,

work page arXiv

[2] [2]

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

S. Feucht, T. Haklay, U. Bhalla, et al. Arithmetic in the wild: Llama uses base-10 addition to reason about cyclic concepts. arXiv:2605.01148,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

mlr.press/v119/kalatzis20a.html

S. Kantamneni and M. Tegmark. Language models use trigonometry to do addition. arXiv:2502.00873,

work page arXiv

[4] [4]

Measuring Faithfulness in Chain-of-Thought Reasoning

T. Lanham, A. Chen, A. Radhakrishnan, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

M. Nye, A. J. Andreassen, G. Gur-Ari, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

In-context Learning and Induction Heads

C. Olsson, N. Elhage, N. Nanda, et al. In-context learning and induction heads. arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery. arXiv:2310.10348,

work page arXiv

[8] [8]

11 A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering. arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

D. Wurgaft, C. Rager, M. Kowal, et al. Manifold steering reveals the shared geometry of neural network representation and behavior. arXiv:2605.05115,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

A. Zou, L. Phan, S. Chen, et al. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

These curves are representational diagnostics only; the causal-use claim rests on the intervention tests and controls. 0 5 10 15 20 25 Layer 0.5 0.6 0.7 0.8 0.9 1.0 Next-state readout decode accuracychance Q8 base ﬁnal-answer running-state 0 5 10 15 20 25 Layer D8 Figure 7: Linear-probe readout of the next-state phase coordinate by layer. Probe accuracy i...

2025

[12] [12]

No single edge passes the single-edge criterion (the largest, L22, removes only 24 percent), so theD 8 update is distributed where theQ 8 update is concentrated

Across all ten splits both matched controls preserve the selectivity (the quotient-source and off-target destination controls stay near 0.82), the clean restoration recovers it, and ordinary unedited behavior is intact (P( ˆpt+1 =p(m·s)) = 0.88). No single edge passes the single-edge criterion (the largest, L22, removes only 24 percent), so theD 8 update ...

2000

[13] [13]

Entries are mean CUS removed. Edge Alone Leave-one-out drop Layer 22, current phase bit 0.212 0.346 Layer 19, move token 0.113 0.021 Layer 25, current phase bit 0.078 0.039 Layer 23, current phase bit 0.069 0.258 Full routeE 4 (all four) 0.532 n/a 24 G Extended related work Scratchpads and faithfulness.Scratchpad and chain-of-thought methods show that ask...

2021