Markovian Circuit Tracing for Transformer State Dynamic
Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3
The pith
Patching recovered-state centroids from residual activations steers transformer predictions toward exact hidden Markov model counterfactual distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Markovian Circuit Tracing recovers coarse state abstractions from residual activations in tiny causal transformers trained on synthetic HMM families. When these recovered-state centroids are patched into the model, the KL divergence to the exact HMM counterfactual target drops from 0.1957 in the unpatched case to 0.0532 on average, outperforming wrong-state, mean-activation, random-activation, and shuffled-label controls across persistent, lower-state, ambiguous-emission, and six-state regimes.
What carries the argument
Markovian Circuit Tracing (MCT), a diagnostic pipeline that extracts state centroids from residual activations and performs causal state-forcing interventions on synthetic HMM benchmarks with known latent states and counterfactual targets.
If this is right
- State forcing via centroid patching supplies a direct causal test of whether a transformer maintains coarse transition structure rather than computing only locally.
- The same pipeline can compare state-recovery strength across HMM regimes that differ in persistence, ambiguity, and number of states.
- Transformers that pass the benchmark encode enough Bayesian belief information in residual streams to support forced-state interventions.
- The controlled synthetic tasks with exact Bayes-optimal targets and counterfactuals provide a reference for future state-dynamics evaluations.
Where Pith is reading between the lines
- The method could be extended to probe whether larger language models internally track latent states when predicting over natural language sequences.
- If the centroid correspondence holds more generally, it would imply that some transformer computations can be reframed as approximate movement through discrete state spaces.
- Further tests might examine whether multi-layer patching recovers finer-grained or hierarchical state information beyond the coarse centroids used here.
Load-bearing premise
Centroids extracted from residual activations correspond to the true latent states of the underlying HMM in a way that allows causal intervention via patching to produce the expected counterfactual output distribution.
What would settle it
A replication in which patching the recovered-state centroid fails to reduce KL divergence to the HMM counterfactual target below the levels achieved by wrong-state or random-activation controls.
Figures
read the original abstract
Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer residual activations encode coarse state-transition structure. It uses synthetic HMM tasks with exact ground truth (latent states, transitions, Bayesian beliefs, and counterfactual targets) across six HMM families and three seeds. Tiny causal transformers achieve near-Bayes performance with mean excess loss 0.0138. State centroids extracted from activations recover transition signal (strongest in persistent and low-state regimes), and centroid patching reduces KL to exact HMM counterfactuals from 0.1957 to 0.0532 on average, outperforming wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and reference pipeline for transformer state-dynamics interpretability.
Significance. If the results hold, the work supplies a valuable, fully controlled benchmark with known ground truth for evaluating state abstraction and causal intervention methods in transformers. The regime-dependent findings, multiple controls, and concrete quantitative metrics (excess loss, KL reductions) make it a useful reference pipeline for interpretability research. The synthetic setup enables precise falsifiable tests that are difficult to obtain in natural data.
major comments (2)
- [Methods] Methods: The manuscript reports aggregate statistics (excess loss 0.0138, KL from 0.1957 to 0.0532) but does not include error bars, per-seed or per-family variances, or explicit dataset-generation parameters. These details are load-bearing for assessing whether the state-recovery and patching effects are robust across the six HMM regimes and three seeds.
- [Results] Results (patching experiment): The claim that recovered-state centroids correspond to true latent states in a causally intervenable way rests on the observed KL reduction and controls. Direct quantitative comparison between the patched model's output distribution and the exact Bayesian belief vector (rather than only the final counterfactual target) would more tightly link the centroid to the HMM state.
minor comments (1)
- [Abstract] The abstract and results sections use 'coarse state-transition structure' without a precise definition or metric for 'coarseness'; a short clarifying sentence would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation and recommendation for minor revision. The comments highlight important aspects for improving the clarity and robustness of our results, which we address below.
read point-by-point responses
-
Referee: [Methods] The manuscript reports aggregate statistics (excess loss 0.0138, KL from 0.1957 to 0.0532) but does not include error bars, per-seed or per-family variances, or explicit dataset-generation parameters. These details are load-bearing for assessing whether the state-recovery and patching effects are robust across the six HMM regimes and three seeds.
Authors: We agree with the referee that including measures of variance and explicit generation parameters is important for evaluating robustness. In the revised version, we will add error bars to all reported aggregate statistics, include a breakdown of results per HMM family and seed, and provide the full dataset-generation parameters (transition matrices, emission distributions, sequence lengths, and random seeds) in the Methods section and a new appendix table. revision: yes
-
Referee: [Results] The claim that recovered-state centroids correspond to true latent states in a causally intervenable way rests on the observed KL reduction and controls. Direct quantitative comparison between the patched model's output distribution and the exact Bayesian belief vector (rather than only the final counterfactual target) would more tightly link the centroid to the HMM state.
Authors: This is a valuable suggestion for strengthening the causal link. While the counterfactual target is the primary metric for assessing intervention success, we will incorporate an additional analysis comparing the patched output distributions to the full Bayesian belief vectors. This will be added to the supplementary results in the revision, providing a more direct quantification of how well the centroids recover the state beliefs. revision: yes
Circularity Check
No significant circularity; empirical benchmark with external HMM ground truth
full rationale
The paper introduces an empirical diagnostic pipeline (MCT) that trains small transformers on synthetic HMM tasks with fully known latent states, transitions, and counterfactual targets. It then extracts residual activations, computes centroids as state abstractions, performs patching interventions, and reports KL reductions to exact HMM counterfactuals while comparing against wrong-state, mean-activation, random, and shuffled controls. All load-bearing claims are direct measurements against this external ground truth rather than any derivation, fit, or self-citation that reduces to the inputs by construction. No equations or steps are shown that equate a prediction to a fitted parameter or rename a known result; the evaluation framework is self-contained against the provided HMM benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls.
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Residual activations contain partial Bayesian belief information... State abstractions extracted from these activations recover coarse transition signal
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The Annals of Mathematical Statistics , volume=
Statistical inference for probabilistic functions of finite state Markov chains , author=. The Annals of Mathematical Statistics , volume=
-
[2]
Proceedings of the IEEE , volume=
A tutorial on hidden Markov models and selected applications in speech recognition , author=. Proceedings of the IEEE , volume=
-
[3]
Advances in Neural Information Processing Systems , year=
Attention is All You Need , author=. Advances in Neural Information Processing Systems , year=
- [4]
-
[5]
A Mathematical Framework for Transformer Circuits , author=. 2021 , howpublished=
work page 2021
- [6]
-
[7]
Advances in Neural Information Processing Systems , volume=
Locating and Editing Factual Associations in GPT , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Advances in Neural Information Processing Systems , year=
Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. Advances in Neural Information Processing Systems , year=
-
[9]
Causal Scrubbing A Method for Rigorously Testing Interpretability Hypotheses , author=. 2022 , eprint=
work page 2022
-
[10]
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=
work page 2023
-
[11]
Towards Monosemanticity Decomposing Language Models With Dictionary Learning , author=. 2023 , eprint=
work page 2023
-
[12]
Sparse coding with an overcomplete basis set A strategy employed by V1? , author=. Vision Research , volume=
-
[13]
Progress Measures for Grokking via Mechanistic Interpretability , author=. 2023 , eprint=
work page 2023
-
[14]
Advances in Neural Information Processing Systems , year=
Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , year=
-
[16]
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=
Designing and Interpreting Probes with Control Tasks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2019
-
[17]
Computational Linguistics , volume=
Probing classifiers Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
-
[18]
Finite state automata and simple recurrent networks , author=. Neural Computation , volume=
-
[19]
Connectionist Models Summer School , year=
A focus on recurrent networks , author=. Connectionist Models Summer School , year=
-
[20]
Extraction of rules from discrete-time recurrent neural networks , author=. Neural Networks , volume=
-
[21]
International Conference on Learning Representations , year=
Practical finite state automata extraction from recurrent neural networks , author=. International Conference on Learning Representations , year=
-
[22]
Interpreting recurrent neural networks behavior via excitable network attractors , author=. International Conference on Machine Learning Workshop on Understanding and Improving Generalization in Deep Learning , year=
-
[23]
Observable operator models for discrete stochastic time series , author=. Neural Computation , volume=
-
[24]
Advances in Neural Information Processing Systems , year=
Predictive representations of state , author=. Advances in Neural Information Processing Systems , year=
-
[25]
The International Journal of Robotics Research , volume=
Closing the learning-planning loop with predictive state representations , author=. The International Journal of Robotics Research , volume=
-
[26]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 37(6):1554--1563, 1966
work page 1966
-
[28]
Probing classifiers: Promises, shortcomings, and advances
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219, 2022
work page 2022
-
[29]
Byron Boots, Sajid Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954--966, 2011
work page 2011
-
[30]
Towards monosemanticity: Decomposing language models with dictionary learning, 2023
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Shan Carter, and Chris Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023. arXiv:2310.01881
-
[31]
Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022
Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Elena Nitishinskaya, Ansh Radhakrishnan, and Buck Shlegeris. Causal scrubbing: A method for rigorously testing interpretability hypotheses, 2022. arXiv:2212.06861
-
[32]
Axel Cleeremans, David Servan-Schreiber, and James L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372--381, 1989
work page 1989
-
[33]
Towards automated circuit discovery for mechanistic interpretability
Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adria Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[34]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. arXiv:2309.08600
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
A mathematical framework for transformer circuits, 2021
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits, 2021. Transformer Circuits Thread
work page 2021
-
[36]
Toy models of superposition, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition, 2022. Transformer Circuits Thread
work page 2022
-
[37]
Causal abstractions of neural networks
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[38]
Designing and interpreting probes with control tasks
John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2733--2743, 2019
work page 2019
-
[39]
Observable operator models for discrete stochastic time series
Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371--1398, 2000
work page 2000
-
[40]
Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state. In Advances in Neural Information Processing Systems, 2001
work page 2001
-
[41]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, 35:17359--17372, 2022
work page 2022
-
[42]
Joshua Michalenko, Ameesh Shah, Abhinav Verma, Richard Baraniuk, Suraj Chaudhuri, and Ankit B. Patel. Interpreting recurrent neural networks behavior via excitable network attractors. In ICML Workshop on Understanding and Improving Generalization in Deep Learning, 2019
work page 2019
-
[43]
Michael C. Mozer. A focus on recurrent networks. In Connectionist Models Summer School, 1989
work page 1989
-
[44]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. arXiv:2301.05217
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3), 2020
work page 2020
-
[46]
Christian W. Omlin and C. Lee Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41--52, 1996
work page 1996
-
[47]
Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989
work page 1989
-
[48]
Practical finite state automata extraction from recurrent neural networks
Gail Weiss, Yoav Goldberg, and Eran Yahav. Practical finite state automata extraction from recurrent neural networks. In International Conference on Learning Representations, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.