arxiv: 2605.07755 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CL

Rethinking State Tracking in Recurrent Models Through Error Control Dynamics

Jiwan Chung , Heechan Choi , Seon Joo Kim This is my paper

Pith reviewed 2026-05-11 03:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords state trackingrecurrent networkserror controlstate-space modelslinear attentionaffine recurrent networksfinite horizon tracking

0 comments

The pith

Affine recurrent networks cannot correct errors along state-separating subspaces once they preserve state representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts the analysis of recurrent models from expressive capacity for symbolic rules to the dynamics of error control in hidden states. It proves that affine recurrent networks, which include state-space models and linear attention, are unable to correct errors in the directions that distinguish different states if they maintain their state representations. This forces such models to rely on finite-horizon tracking where accumulated errors eventually overwhelm the initial separation between states. Experiments on group state-tracking tasks confirm that the breakdown occurs at a predictable distinguishability ratio, and this ratio accurately forecasts when decoder accuracy collapses.

Core claim

We prove that affine recurrent networks cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation.

What carries the argument

The no-correction result for affine recurrent networks that preserve state representations, which establishes that drift along state-distinguishing directions cannot be corrected.

Load-bearing premise

The networks preserve state representations as the precondition under which the no-correction result holds.

What would settle it

An empirical demonstration that an affine recurrent network corrects errors along state-separating subspaces while still preserving state representations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.07755 by Heechan Choi, Jiwan Chung, Seon Joo Kim.

**Figure 1.** Figure 1: Perturbation recovery after noise injection on S3. Top: error trajectories in 2D PCA spaces. Bottom: normalized error magnitude ∥et∥/∥e0∥ over time. Affine models show either global decay (Mamba, Mamba-3, and Negative Mamba) or expansion (Token-gated RNN), while state-dependent models (tanh RNN and State-gated RNN) show strong error contraction. at which test accuracy remains above 90%. For each model and … view at source ↗

**Figure 2.** Figure 2: Distinguishability ratio q(t) over rollouts on S3. Top: q(t) = R(t)/M(t) from Section 3.2, with the dashed line marking the nearest-centroid bound q(t) = 1/2. Bottom: R(t) on log-log axes. Gray curves show latent-space counterparts. Vertical dashes mark each model’s mp from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Subspace decomposition of within-class spread. For each architecture, we decompose within-class spread into the symbolic-subspace component qU (t) = rerr,U (t)/rsep(t) (color) and the orthogonal component qU⊥ (t) = rerr,U⊥ (t)/rsep(t) (gray), both on a log scale. Vertical dashes mark each model’s mp from [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Readability collapse coincides with downstream failure. On S3, the distinguishability ratio qt is plotted against failure-normalized time t/mp. The dotted reference at qt = 1 marks where within-class spread equals the smallest inter-class margin. Pearson r = 0.87 on log Tcross vs log mp. The trained readout fails later than the nearest-centroid bound suggests: although qt < 0.5 is sufficient for nearest-ce… view at source ↗

read the original abstract

The theory of state tracking in recurrent architectures has predominantly focused on expressive capacity: whether a fixed architecture can theoretically realize a set of symbolic transition rules. We argue that equally important is error control, the dynamics governing hidden-state drift along the directions that distinguish symbolic states. We prove that affine recurrent networks, a class of models encompassing State-Space Models and Linear Attention, cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation. We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails. These results establish that robust state tracking is determined not only by an architecture's theoretical expressivity but crucially by its error control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves affine recurrent nets cannot correct errors in state-separating subspaces under a preservation precondition, and shows this predicts finite-horizon collapse on group tasks, but the empirical claim hinges on whether trained models actually preserve representations.

read the letter

The key point is that affine recurrent models, including SSMs and linear attention, cannot fix accumulated drift along directions that separate symbolic states once they keep those representations intact. This leads to the claim that practical versions only solve finite-horizon tracking before error buildup makes states unreadable to the decoder. The proof itself looks independent of fitted parameters and focuses on the dynamics of error control rather than pure expressivity, which is a useful reframing. Empirically, they track a distinguishability ratio on group state-tracking tasks and show that crossing a decoder-specific readability threshold lines up with the drop in downstream accuracy, giving a concrete way to forecast the horizon. That predictive link is the strongest part of the work and could be checked on other sequence problems. The main soft spot is the preservation precondition. The theory only rules out correction when representations are preserved; if training produces maps that allow within-class drift while still permitting decoding, the no-correction result does not constrain the observed failure and the finite-horizon explanation loses its mechanistic grounding. The abstract and stress-test note give no clear sign that preservation was verified in the trained models, so the jump from conditional proof to practical conclusion feels under-supported. The readability threshold coming from the model's own hidden-state statistics also invites a circularity worry, though it may be minor if the threshold is fixed before measuring collapse. This paper is for people working on recurrent architectures for long-horizon reasoning or state tracking. It deserves peer review because the theoretical angle is new and the empirical prediction is falsifiable, even if the preservation check and threshold details need tightening in revision.

Referee Report

2 major / 1 minor

Summary. The paper claims that error control dynamics, rather than just expressive capacity, determine state tracking performance in recurrent models. It proves that affine recurrent networks (encompassing SSMs and linear attention) cannot correct errors along state-separating subspaces once they preserve state representations, implying that practical models learn only finite-horizon solutions governed by accumulated within-class error. The authors characterize failure via a distinguishability ratio (accumulated within-class spread relative to initial between-class separation) crossing a decoder readability threshold, and empirically show on group state-tracking tasks that this crossing predicts the horizon of downstream accuracy collapse.

Significance. If the proof is sound and the preservation precondition holds for trained models, the work offers a mechanistic account of why affine recurrent architectures exhibit finite-horizon state tracking despite theoretical expressivity, shifting emphasis toward error-control properties. The empirical predictability of breakdown across models is a concrete strength that could inform architecture design. However, significance is limited by the unverified link between the theoretical precondition and the trained networks studied.

major comments (2)

[theoretical proof and experimental setup] The central no-correction theorem (abstract and theoretical development) is conditioned on the precondition that affine networks preserve state representations. The manuscript provides no verification or enforcement that the trained models in the group state-tracking experiments satisfy this property; if training produces non-preserving maps (e.g., within-class drift still decodable downstream), the theoretical prohibition does not constrain or explain the observed collapse.
[empirical results on group state-tracking tasks] The distinguishability ratio and readability threshold used to predict breakdown (empirical validation) are both derived from the model's own hidden-state statistics and trained decoder. This creates a circularity risk: the threshold is not an independent property but fitted to the same representations whose drift is being measured, weakening the claim that the crossing 'predicts' the failure horizon independently of post-hoc fitting.

minor comments (1)

[abstract and introduction] The abstract and introduction would benefit from explicit definitions or forward references for 'state-separating subspaces,' 'distinguishability ratio,' and 'readability threshold' to improve accessibility before the technical sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the link between our theoretical results and empirical findings. We address each major comment below and will revise the manuscript to strengthen the connection between the no-correction theorem and the group state-tracking experiments.

read point-by-point responses

Referee: [theoretical proof and experimental setup] The central no-correction theorem (abstract and theoretical development) is conditioned on the precondition that affine networks preserve state representations. The manuscript provides no verification or enforcement that the trained models in the group state-tracking experiments satisfy this property; if training produces non-preserving maps (e.g., within-class drift still decodable downstream), the theoretical prohibition does not constrain or explain the observed collapse.

Authors: We agree that explicit verification of the state-representation preservation precondition is necessary for the theorem to explain the observed collapse in trained models. In the revised manuscript, we will add a dedicated analysis subsection that measures preservation in the trained affine networks on the group state-tracking tasks. This will include tracking linear separability of symbolic states (via between-class margins) and confirming that within-class drift remains small relative to initial separation until the distinguishability ratio threshold is crossed. If any trained models violate preservation, we will report this and discuss its implications for the finite-horizon interpretation. revision: yes
Referee: [empirical results on group state-tracking tasks] The distinguishability ratio and readability threshold used to predict breakdown (empirical validation) are both derived from the model's own hidden-state statistics and trained decoder. This creates a circularity risk: the threshold is not an independent property but fitted to the same representations whose drift is being measured, weakening the claim that the crossing 'predicts' the failure horizon independently of post-hoc fitting.

Authors: We acknowledge the potential circularity arising from using the same trained decoder for both the readability threshold and the hidden-state statistics. In revision, we will replace the threshold with an independent measure: a fixed readability threshold derived from initial between-class separation (before any error accumulation) and a cross-validated linear probe trained on held-out sequences. We will then re-evaluate the correlation between distinguishability-ratio crossings and accuracy collapse under these independent thresholds, reporting results to demonstrate that the predictive relationship holds without post-hoc fitting to the full dataset. revision: yes

Circularity Check

1 steps flagged

Empirical prediction of tracking horizon reduces to model-derived distinguishability crossing by construction

specific steps

fitted input called prediction [Abstract (empirical demonstration paragraph)]
"We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails."

The distinguishability ratio is defined from the model's own hidden-state statistics (accumulating within-class spread relative to initial between-class separation), and the readability threshold is taken from the trained decoder's properties on those states. Declaring that the crossing 'predicts' the failure horizon therefore equates the prediction to the point at which the decoder can no longer distinguish the states by construction, rather than providing an external validation of the error-control theory.

full rationale

The mathematical proof that affine networks cannot correct errors under state preservation is a conditional first-principles result with no evident self-reference or data fitting. However, the central empirical claim—that the crossing of the distinguishability ratio with the decoder's readability threshold predicts the accuracy failure horizon—uses quantities extracted directly from the trained model's hidden-state statistics and decoder performance. This makes the 'prediction' a restatement of when the states become unreadable to the same decoder rather than an independent test, constituting moderate fitted-input-called-prediction circularity while leaving the theoretical core intact.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on linear-algebraic definitions of state-separating subspaces and the domain assumption that affine networks preserve representations; the distinguishability ratio appears to be a derived quantity whose exact construction is not detailed in the abstract.

free parameters (1)

readability threshold of the trained decoder
The point at which accumulated within-class spread renders states unreadable; its value is determined from model outputs rather than fixed a priori.

axioms (2)

domain assumption Affine recurrent networks preserve state representations under the conditions considered.
Invoked as the precondition for the no-error-correction result in the proof.
domain assumption Error accumulation occurs along state-separating subspaces in a manner governed by affine dynamics.
Underlies the characterization of finite-horizon tracking.

invented entities (1)

state-separating subspaces no independent evidence
purpose: Directions in hidden-state space along which symbolic states are distinguished and along which error drift is analyzed.
Core construct for the error-control analysis; no independent falsifiable evidence outside the framework is provided in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1470 out tokens · 53740 ms · 2026-05-11T03:12:52.360907+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that affine recurrent networks... cannot correct errors along state-separating subspaces once they preserve state representations. ... q(t) := R(t)/M(t) ... crosses the readability threshold
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Affine neutrality on the symbolic subspace). ... As|U = I

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Sorting out L ipschitz function approximation

Cem Anil, James Lucas, and Roger Grosse. Sorting out L ipschitz function approximation. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019

work page 2019
[2]

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014

work page 2014
[3]

Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14 0 (2): 0 179--211, 1990

work page 1990
[4]

Unlocking state-tracking in linear rnns through negative eigenvalues

Riccardo Grazzi, Julien Siems, Arber Zela, Jorg KH Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. In 13th International Conference on Learning Representations Iclr 2025, pages 1--33. ICLR, 2025

work page 2025
[5]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=tEYskw1VY2

work page 2024
[6]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[7]

Diagonal state spaces are as effective as structured state spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. Advances in neural information processing systems, 35: 0 22982--22994, 2022

work page 2022
[8]

Long short-term memory

Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[9]

Z., Fujii, N., and Graybiel, A

Arjun Karuvally, Franz Nowak, Anderson T. Keller, Carmen Amo Alonso, Terrence J. Sejnowski, and Hava T. Siegelmann. Bridging expressivity and scalability with adaptive unitary ssms. arXiv preprint arXiv:2507.05238, 2025

work page arXiv 2025
[10]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

work page 2020
[11]

arXiv preprint arXiv:2603.15569 , year=

Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. arXiv preprint arXiv:2603.15569, 2026

work page arXiv 2026
[12]

The illusion of state in state-space models

William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Researc...

work page 2024
[13]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International conference on machine learning, pages 26670--26698. PMLR, 2023

work page 2023
[14]

An introduction to the theory of groups

Joseph J Rotman. An introduction to the theory of groups. Springer Science & Business Media, 2012

work page 2012
[15]

The expressive capacity of state space models: A formal language perspective

Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. Advances in Neural Information Processing Systems, 37: 0 41202--41241, 2024

work page 2024
[16]

The expressive limits of diagonal SSMs for state-tracking

Mehran Shakerinava, Behnoush Khavari, Siamak Ravanbakhsh, and Sarath Chandar. The expressive limits of diagonal SSMs for state-tracking. In International Conference on Learning Representations (ICLR), 2026

work page 2026
[17]

Deltaproduct: Improving state-tracking in linear RNN s via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Improving state-tracking in linear RNN s via householder products. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=SoRiaijTGr

work page 2025
[18]

On the expressiveness and length generalization of selective state space models on regular languages

Aleksandar Terzic, Michael Hersche, Giacomo Camposampiero, Thomas Hofmann, Abu Sebastian, and Abbas Rahimi. On the expressiveness and length generalization of selective state space models on regular languages. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20876--20884, 2025 a

work page 2025
[19]

Structured sparse transition matrices to enable state tracking in state-space models

Aleksandar Terzic, Nicolas Menet, Michael Hersche, Thomas Hofmann, and Abbas Rahimi. Structured sparse transition matrices to enable state tracking in state-space models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b

work page 2025
[20]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Advances in neural information processing systems, volume 37, pages 115491--115522, 2024

work page 2024