Markovian Circuit Tracing for Transformer State Dynamic

Abdullah X

arxiv: 2605.20824 · v1 · pith:WVK7EY2Qnew · submitted 2026-05-20 · 💻 cs.LG

Markovian Circuit Tracing for Transformer State Dynamic

Abdullah X This is my paper

Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformer interpretabilityhidden markov modelsstate dynamicscausal interventionsresidual activationssynthetic benchmarksnext-token predictioncounterfactual targets

0 comments

The pith

Patching recovered-state centroids from residual activations steers transformer predictions toward exact hidden Markov model counterfactual distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers trained on synthetic HMM tasks learn to approximate Bayes-optimal next-token predictions with low excess loss. Residual activations in these models hold partial information about the underlying latent states. Extracting coarse state abstractions as centroids from the activations allows targeted interventions. Patching the correct centroid into the model moves output distributions substantially closer to the forced-state counterfactual targets produced by an exact HMM. This setup supplies a controlled benchmark for checking whether transformers maintain internal state-transition structure during sequence computation.

Core claim

Markovian Circuit Tracing recovers coarse state abstractions from residual activations in tiny causal transformers trained on synthetic HMM families. When these recovered-state centroids are patched into the model, the KL divergence to the exact HMM counterfactual target drops from 0.1957 in the unpatched case to 0.0532 on average, outperforming wrong-state, mean-activation, random-activation, and shuffled-label controls across persistent, lower-state, ambiguous-emission, and six-state regimes.

What carries the argument

Markovian Circuit Tracing (MCT), a diagnostic pipeline that extracts state centroids from residual activations and performs causal state-forcing interventions on synthetic HMM benchmarks with known latent states and counterfactual targets.

If this is right

State forcing via centroid patching supplies a direct causal test of whether a transformer maintains coarse transition structure rather than computing only locally.
The same pipeline can compare state-recovery strength across HMM regimes that differ in persistence, ambiguity, and number of states.
Transformers that pass the benchmark encode enough Bayesian belief information in residual streams to support forced-state interventions.
The controlled synthetic tasks with exact Bayes-optimal targets and counterfactuals provide a reference for future state-dynamics evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to probe whether larger language models internally track latent states when predicting over natural language sequences.
If the centroid correspondence holds more generally, it would imply that some transformer computations can be reframed as approximate movement through discrete state spaces.
Further tests might examine whether multi-layer patching recovers finer-grained or hierarchical state information beyond the coarse centroids used here.

Load-bearing premise

Centroids extracted from residual activations correspond to the true latent states of the underlying HMM in a way that allows causal intervention via patching to produce the expected counterfactual output distribution.

What would settle it

A replication in which patching the recovered-state centroid fails to reduce KL divergence to the HMM counterfactual target below the levels achieved by wrong-state or random-activation controls.

Figures

Figures reproduced from arXiv: 2605.20824 by Abdullah X.

**Figure 2.** Figure 2: Family-level behavior. Left, residual MCT transition row-wise KL. Right, held-out order-0 to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: True- [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise results. Later layers often expose stronger transition and belief structure, though the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: State-forcing controls. Recovered-centroid patching gives the lowest KL to the exact HMM [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Conceptual Markovian state-transition view used by the benchmark. The diagram is illustrative [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Residual MCT under cluster-count misspecification. Left, belief reconstruction KL. Right, next [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: State-forcing controls by family. Recovered-centroid patching is strongest in the families with [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCT sets up a clean synthetic benchmark for recovering coarse state structure from transformer residuals on HMM tasks, with patching that beats controls and gives a usable reference pipeline.

read the letter

The paper's main contribution is a controlled diagnostic pipeline called Markovian Circuit Tracing that uses known HMM ground truth to test whether residual activations in small causal transformers encode recoverable state-transition information. They train on six HMM families, get models within 0.0138 excess loss of Bayes optimal, extract centroids from activations, and show that patching a recovered-state centroid drops KL to the exact counterfactual target from 0.1957 down to 0.0532 on average. The controls (wrong-state, mean activation, random, shuffled) make the state-specific effect clearer than generic activation changes would be. Recovery is stronger in persistent and low-state regimes and weaker in ambiguous-emission or six-state ones, which they report directly. This gives a reproducible reference setup with exact Bayesian beliefs and forced-state targets that prior interpretability work on real models often lacks. The central patching result holds up against the listed controls, so the benchmark looks usable for testing state-dynamics claims. The main limitation is that everything stays inside synthetic HMMs, so it does not yet address whether the same extraction works on natural language or larger models. The correspondence between centroids and true latents is demonstrated empirically rather than assumed by construction, and the regime dependence is already flagged. This is worth a serious referee for groups doing mechanistic interpretability or activation patching benchmarks; anyone building controlled tests for sequence model internals would get concrete value from the framework and the reported numbers. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer residual activations encode coarse state-transition structure. It uses synthetic HMM tasks with exact ground truth (latent states, transitions, Bayesian beliefs, and counterfactual targets) across six HMM families and three seeds. Tiny causal transformers achieve near-Bayes performance with mean excess loss 0.0138. State centroids extracted from activations recover transition signal (strongest in persistent and low-state regimes), and centroid patching reduces KL to exact HMM counterfactuals from 0.1957 to 0.0532 on average, outperforming wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and reference pipeline for transformer state-dynamics interpretability.

Significance. If the results hold, the work supplies a valuable, fully controlled benchmark with known ground truth for evaluating state abstraction and causal intervention methods in transformers. The regime-dependent findings, multiple controls, and concrete quantitative metrics (excess loss, KL reductions) make it a useful reference pipeline for interpretability research. The synthetic setup enables precise falsifiable tests that are difficult to obtain in natural data.

major comments (2)

[Methods] Methods: The manuscript reports aggregate statistics (excess loss 0.0138, KL from 0.1957 to 0.0532) but does not include error bars, per-seed or per-family variances, or explicit dataset-generation parameters. These details are load-bearing for assessing whether the state-recovery and patching effects are robust across the six HMM regimes and three seeds.
[Results] Results (patching experiment): The claim that recovered-state centroids correspond to true latent states in a causally intervenable way rests on the observed KL reduction and controls. Direct quantitative comparison between the patched model's output distribution and the exact Bayesian belief vector (rather than only the final counterfactual target) would more tightly link the centroid to the HMM state.

minor comments (1)

[Abstract] The abstract and results sections use 'coarse state-transition structure' without a precise definition or metric for 'coarseness'; a short clarifying sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. The comments highlight important aspects for improving the clarity and robustness of our results, which we address below.

read point-by-point responses

Referee: [Methods] The manuscript reports aggregate statistics (excess loss 0.0138, KL from 0.1957 to 0.0532) but does not include error bars, per-seed or per-family variances, or explicit dataset-generation parameters. These details are load-bearing for assessing whether the state-recovery and patching effects are robust across the six HMM regimes and three seeds.

Authors: We agree with the referee that including measures of variance and explicit generation parameters is important for evaluating robustness. In the revised version, we will add error bars to all reported aggregate statistics, include a breakdown of results per HMM family and seed, and provide the full dataset-generation parameters (transition matrices, emission distributions, sequence lengths, and random seeds) in the Methods section and a new appendix table. revision: yes
Referee: [Results] The claim that recovered-state centroids correspond to true latent states in a causally intervenable way rests on the observed KL reduction and controls. Direct quantitative comparison between the patched model's output distribution and the exact Bayesian belief vector (rather than only the final counterfactual target) would more tightly link the centroid to the HMM state.

Authors: This is a valuable suggestion for strengthening the causal link. While the counterfactual target is the primary metric for assessing intervention success, we will incorporate an additional analysis comparing the patched output distributions to the full Bayesian belief vectors. This will be added to the supplementary results in the revision, providing a more direct quantification of how well the centroids recover the state beliefs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with external HMM ground truth

full rationale

The paper introduces an empirical diagnostic pipeline (MCT) that trains small transformers on synthetic HMM tasks with fully known latent states, transitions, and counterfactual targets. It then extracts residual activations, computes centroids as state abstractions, performs patching interventions, and reports KL reductions to exact HMM counterfactuals while comparing against wrong-state, mean-activation, random, and shuffled controls. All load-bearing claims are direct measurements against this external ground truth rather than any derivation, fit, or self-citation that reduces to the inputs by construction. No equations or steps are shown that equate a prediction to a fitted parameter or rename a known result; the evaluation framework is self-contained against the provided HMM benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are detailed beyond the use of synthetic HMM tasks and small causal transformers.

pith-pipeline@v0.9.0 · 5729 in / 1176 out tokens · 37428 ms · 2026-05-21T06:02:11.654771+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls.
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Residual activations contain partial Bayesian belief information... State abstractions extracted from these activations recover coarse transition signal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 3 internal anchors

[1]

The Annals of Mathematical Statistics , volume=

Statistical inference for probabilistic functions of finite state Markov chains , author=. The Annals of Mathematical Statistics , volume=

work page
[2]

Proceedings of the IEEE , volume=

A tutorial on hidden Markov models and selected applications in speech recognition , author=. Proceedings of the IEEE , volume=

work page
[3]

Advances in Neural Information Processing Systems , year=

Attention is All You Need , author=. Advances in Neural Information Processing Systems , year=

work page
[4]

Distill , volume=

Zoom In An Introduction to Circuits , author=. Distill , volume=

work page
[5]

2021 , howpublished=

A Mathematical Framework for Transformer Circuits , author=. 2021 , howpublished=

work page 2021
[6]

2022 , howpublished=

Toy Models of Superposition , author=. 2022 , howpublished=

work page 2022
[7]

Advances in Neural Information Processing Systems , volume=

Locating and Editing Factual Associations in GPT , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=

work page
[9]

2022 , eprint=

Causal Scrubbing A Method for Rigorously Testing Interpretability Hypotheses , author=. 2022 , eprint=

work page 2022
[10]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

work page 2023
[11]

2023 , eprint=

Towards Monosemanticity Decomposing Language Models With Dictionary Learning , author=. 2023 , eprint=

work page 2023
[12]

Vision Research , volume=

Sparse coding with an overcomplete basis set A strategy employed by V1? , author=. Vision Research , volume=

work page
[13]

2023 , eprint=

Progress Measures for Grokking via Mechanistic Interpretability , author=. 2023 , eprint=

work page 2023
[14]

Advances in Neural Information Processing Systems , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , year=

work page
[16]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

Designing and Interpreting Probes with Control Tasks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2019
[17]

Computational Linguistics , volume=

Probing classifiers Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page
[18]

Neural Computation , volume=

Finite state automata and simple recurrent networks , author=. Neural Computation , volume=

work page
[19]

Connectionist Models Summer School , year=

A focus on recurrent networks , author=. Connectionist Models Summer School , year=

work page
[20]

Neural Networks , volume=

Extraction of rules from discrete-time recurrent neural networks , author=. Neural Networks , volume=

work page
[21]

International Conference on Learning Representations , year=

Practical finite state automata extraction from recurrent neural networks , author=. International Conference on Learning Representations , year=

work page
[22]

International Conference on Machine Learning Workshop on Understanding and Improving Generalization in Deep Learning , year=

Interpreting recurrent neural networks behavior via excitable network attractors , author=. International Conference on Machine Learning Workshop on Understanding and Improving Generalization in Deep Learning , year=

work page
[23]

Neural Computation , volume=

Observable operator models for discrete stochastic time series , author=. Neural Computation , volume=

work page
[24]

Advances in Neural Information Processing Systems , year=

Predictive representations of state , author=. Advances in Neural Information Processing Systems , year=

work page
[25]

The International Journal of Robotics Research , volume=

Closing the learning-planning loop with predictive state representations , author=. The International Journal of Robotics Research , volume=

work page
[26]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Baum and Ted Petrie

Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6):1554--1563, 1966

work page 1966
[28]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219, 2022

work page 2022
[29]

Byron Boots, Sajid Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954--966, 2011

work page 2011
[30]

Towards monosemanticity: Decomposing language models with dictionary learning, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Shan Carter, and Chris Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023. arXiv:2310.01881

work page arXiv 2023
[31]

Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022

Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Elena Nitishinskaya, Ansh Radhakrishnan, and Buck Shlegeris. Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022. arXiv:2212.06861

work page arXiv 2022
[32]

McClelland

Axel Cleeremans, David Servan-Schreiber, and James L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372--381, 1989

work page 1989
[33]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adria Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, 2023

work page 2023
[34]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

A mathematical framework for transformer circuits, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021. Transformer Circuits Thread

work page 2021
[36]

Toy models of superposition, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition, 2022. Transformer Circuits Thread

work page 2022
[37]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, 2021

work page 2021
[38]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2733--2743, 2019

work page 2019
[39]

Observable operator models for discrete stochastic time series

Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371--1398, 2000

work page 2000
[40]

Littman, Richard S

Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state. In Advances in Neural Information Processing Systems, 2001

work page 2001
[41]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 35:17359--17372, 2022

work page 2022
[42]

Joshua Michalenko, Ameesh Shah, Abhinav Verma, Richard Baraniuk, Suraj Chaudhuri, and Ankit B. Patel. Interpreting recurrent neural networks behavior via excitable network attractors. In ICML Workshop on Understanding and Improving Generalization in Deep Learning, 2019

work page 2019
[43]

Michael C. Mozer. A focus on recurrent networks. In Connectionist Models Summer School, 1989

work page 1989
[44]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3), 2020

work page 2020
[46]

Omlin and C

Christian W. Omlin and C. Lee Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41--52, 1996

work page 1996
[47]

Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989

work page 1989
[48]

Practical finite state automata extraction from recurrent neural networks

Gail Weiss, Yoav Goldberg, and Eran Yahav. Practical finite state automata extraction from recurrent neural networks. In International Conference on Learning Representations, 2018

work page 2018

[1] [1]

The Annals of Mathematical Statistics , volume=

Statistical inference for probabilistic functions of finite state Markov chains , author=. The Annals of Mathematical Statistics , volume=

work page

[2] [2]

Proceedings of the IEEE , volume=

A tutorial on hidden Markov models and selected applications in speech recognition , author=. Proceedings of the IEEE , volume=

work page

[3] [3]

Advances in Neural Information Processing Systems , year=

Attention is All You Need , author=. Advances in Neural Information Processing Systems , year=

work page

[4] [4]

Distill , volume=

Zoom In An Introduction to Circuits , author=. Distill , volume=

work page

[5] [5]

2021 , howpublished=

A Mathematical Framework for Transformer Circuits , author=. 2021 , howpublished=

work page 2021

[6] [6]

2022 , howpublished=

Toy Models of Superposition , author=. 2022 , howpublished=

work page 2022

[7] [7]

Advances in Neural Information Processing Systems , volume=

Locating and Editing Factual Associations in GPT , author=. Advances in Neural Information Processing Systems , volume=

work page

[8] [8]

Advances in Neural Information Processing Systems , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=

work page

[9] [9]

2022 , eprint=

Causal Scrubbing A Method for Rigorously Testing Interpretability Hypotheses , author=. 2022 , eprint=

work page 2022

[10] [10]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

work page 2023

[11] [11]

2023 , eprint=

Towards Monosemanticity Decomposing Language Models With Dictionary Learning , author=. 2023 , eprint=

work page 2023

[12] [12]

Vision Research , volume=

Sparse coding with an overcomplete basis set A strategy employed by V1? , author=. Vision Research , volume=

work page

[13] [13]

2023 , eprint=

Progress Measures for Grokking via Mechanistic Interpretability , author=. 2023 , eprint=

work page 2023

[14] [14]

Advances in Neural Information Processing Systems , year=

Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , year=

work page

[15] [16]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

Designing and Interpreting Probes with Control Tasks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2019

[16] [17]

Computational Linguistics , volume=

Probing classifiers Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page

[17] [18]

Neural Computation , volume=

Finite state automata and simple recurrent networks , author=. Neural Computation , volume=

work page

[18] [19]

Connectionist Models Summer School , year=

A focus on recurrent networks , author=. Connectionist Models Summer School , year=

work page

[19] [20]

Neural Networks , volume=

Extraction of rules from discrete-time recurrent neural networks , author=. Neural Networks , volume=

work page

[20] [21]

International Conference on Learning Representations , year=

Practical finite state automata extraction from recurrent neural networks , author=. International Conference on Learning Representations , year=

work page

[21] [22]

International Conference on Machine Learning Workshop on Understanding and Improving Generalization in Deep Learning , year=

Interpreting recurrent neural networks behavior via excitable network attractors , author=. International Conference on Machine Learning Workshop on Understanding and Improving Generalization in Deep Learning , year=

work page

[22] [23]

Neural Computation , volume=

Observable operator models for discrete stochastic time series , author=. Neural Computation , volume=

work page

[23] [24]

Advances in Neural Information Processing Systems , year=

Predictive representations of state , author=. Advances in Neural Information Processing Systems , year=

work page

[24] [25]

The International Journal of Robotics Research , volume=

Closing the learning-planning loop with predictive state representations , author=. The International Journal of Robotics Research , volume=

work page

[25] [26]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [27]

Baum and Ted Petrie

Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6):1554--1563, 1966

work page 1966

[27] [28]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219, 2022

work page 2022

[28] [29]

Byron Boots, Sajid Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954--966, 2011

work page 2011

[29] [30]

Towards monosemanticity: Decomposing language models with dictionary learning, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Shan Carter, and Chris Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023. arXiv:2310.01881

work page arXiv 2023

[30] [31]

Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022

Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Elena Nitishinskaya, Ansh Radhakrishnan, and Buck Shlegeris. Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022. arXiv:2212.06861

work page arXiv 2022

[31] [32]

McClelland

Axel Cleeremans, David Servan-Schreiber, and James L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372--381, 1989

work page 1989

[32] [33]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adria Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, 2023

work page 2023

[33] [34]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. arXiv:2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [35]

A mathematical framework for transformer circuits, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021. Transformer Circuits Thread

work page 2021

[35] [36]

Toy models of superposition, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition, 2022. Transformer Circuits Thread

work page 2022

[36] [37]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, 2021

work page 2021

[37] [38]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2733--2743, 2019

work page 2019

[38] [39]

Observable operator models for discrete stochastic time series

Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371--1398, 2000

work page 2000

[39] [40]

Littman, Richard S

Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state. In Advances in Neural Information Processing Systems, 2001

work page 2001

[40] [41]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 35:17359--17372, 2022

work page 2022

[41] [42]

Joshua Michalenko, Ameesh Shah, Abhinav Verma, Richard Baraniuk, Suraj Chaudhuri, and Ankit B. Patel. Interpreting recurrent neural networks behavior via excitable network attractors. In ICML Workshop on Understanding and Improving Generalization in Deep Learning, 2019

work page 2019

[42] [43]

Michael C. Mozer. A focus on recurrent networks. In Connectionist Models Summer School, 1989

work page 1989

[43] [44]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3), 2020

work page 2020

[45] [46]

Omlin and C

Christian W. Omlin and C. Lee Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41--52, 1996

work page 1996

[46] [47]

Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989

work page 1989

[47] [48]

Practical finite state automata extraction from recurrent neural networks

Gail Weiss, Yoav Goldberg, and Eran Yahav. Practical finite state automata extraction from recurrent neural networks. In International Conference on Learning Representations, 2018

work page 2018