pith. machine review for the scientific record. sign in

arxiv: 2605.00206 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.CL

Recognition: unknown

State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

Thea Aviss

Pith reviewed 2026-05-09 20:26 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords state stream transformerlatent space reasoningnonlinear recurrencetransformer architectureparallel trainingreasoning benchmarksGSM8KGPQA-Diamond
0
0 comments X

The pith

Nonlinear recurrence with state streaming in transformers delivers 15-point gains on out-of-distribution reasoning benchmarks with only small additional training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the State Stream Transformer V2 to keep a continuous latent state across positions instead of discarding it and rebuilding context each time. It does this by adding an FFN-driven nonlinear recurrence at every decoder layer and streaming the states horizontally across the sequence with a learned blend. A two-pass parallel training method removes the sequential bottleneck so the model can be trained efficiently. When this mechanism is co-trained into a 27B backbone using only a small GSM8K dataset, it produces a 15.15-point lift on GPQA-Diamond and cuts remaining GSM8K errors by 46 percent. A sympathetic reader would care because the gains appear to come from the architecture itself rather than from extra scale or data, suggesting a route to stronger reasoning inside existing models.

Core claim

The SST V2 enables parameter-efficient reasoning in continuous latent space through an FFN-driven nonlinear recurrence at each decoder layer, where latent states are streamed horizontally across the full sequence via a learned blend. This same mechanism supports continuous latent deliberation per position at inference time, dedicating additional FLOPs to exploring abstract reasoning before committing to a token. A two-pass parallel training procedure resolves the sequential dependency of the recurrence to allow compute-efficient training. Hidden state analysis shows the state stream facilitates reasoning through exploration of distinct semantic basins in continuous latent space, where transt

What carries the argument

FFN-driven nonlinear recurrence with horizontal state streaming via a learned blend at each decoder layer

If this is right

  • The reasoning improvements are attributable to the architectural mechanism rather than scale or training data.
  • The design supports continuous latent deliberation per position at inference by dedicating extra FLOPs before token generation.
  • State transitions at content-dependent positions move the model into substantially different Bayesian posteriors that influence future latent states.
  • The resulting 27B SST achieves higher accuracy on GPQA-Diamond than several larger open-weight and proprietary models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The probe finding that the first-token latent state already predicts answer survival under further computation suggests a natural route to adaptive inference that spends more deliberation only when the early state is uncertain.
  • Because large gains appear with only a small co-training set, the same recurrence could be grafted onto other existing backbones to improve reasoning without full retraining.
  • The parallel training procedure implies that similar state mechanisms could be added to other sequence models without incurring prohibitive sequential training costs.

Load-bearing premise

The observed gains on GPQA-Diamond and GSM8K are caused by the state-stream mechanism enabling genuine latent-space reasoning rather than by the specific training procedure, probe, or benchmark selection.

What would settle it

Train an otherwise identical 27B model with the state-streaming component disabled, using the same small GSM8K co-training set and two-pass procedure, then measure whether the +15.15 point gain on GPQA-Diamond and the 46% error reduction on GSM8K disappear.

Figures

Figures reproduced from arXiv: 2605.00206 by Thea Aviss.

Figure 1
Figure 1. Figure 1: The state stream mechanism. Green arrows denote the [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learned blend coefficient structure. (a) Within-layer percentile bands (p5–p95 outer, p25–p75 inner, median line) of the 5,376 per-dimension blend coefficients αl,d at each of the 62 layers. At initialisation, every value was αinit ≈ 0.027; the visible spread is the consequence of training. (b) Each layer’s deviation vector αl − αinit projected onto the top three principal components (PC1, PC2, PC3 explain… view at source ↗
Figure 3
Figure 3. Figure 3: Top-1024 overlap between iter=1 and iter=4 hidden states across the layer stack and generated sequence, for a representative GPQA-Diamond question. Bright cells = stable; dark cells = low-overlap reorganisation. Low￾overlap positions appear as vertical streaks cascading through the depth of the model; stable positions remain near 1.0 at every layer. Threshold 0.976 (GMM crossover) is marked on the colour b… view at source ↗
Figure 4
Figure 4. Figure 4: Layer profile of top-1024 overlap (iter=1 vs iter=4) at low-overlap positions. (a) Position 0 (N = 198): universal basin shift, with the local trough at the layer 25 feedforward activity band. (b) Positions 7, 8, 9 (representative): divergence onset at ∼layer 25, trough in the middle-to-late layers (∼layer 50), rising overlap at deep layers. Solid line = median; shaded bands = IQR, p10–p90, and p5–p95. Uni… view at source ↗
Figure 5
Figure 5. Figure 5: Per-question pass/fail trajectories across iteration depths on GPQA-Diamond (N = 198). Flow of the 198 questions between pass (green) and fail (red) columns at iter = 1, flat iter = 2, flat iter = 3, and flat iter = 4. Each transition decomposes into four ribbons: pass→pass (stable correct), pass→fail (regression), fail→pass (recovery), fail→fail (stable wrong). Accuracy above each column shows the aggrega… view at source ↗
Figure 6
Figure 6. Figure 6: L2 convergence monitoring fails for the SST. The nonlinear recurrence does not converge to a fixed point: all difficulty groups show the same L2 profile (a), and current L2 delta has no predictive power over whether the next iteration will help or hurt (b). The natural first step is to look for convergence in the iterative recurrence, and L2 delta between successive iterations is the most direct way to mea… view at source ↗
Figure 7
Figure 7. Figure 7: The 107 essential hidden-state dimensions at layer 15. Each cell represents one of the 5,376 hidden-state dimensions (84 × 64 grid, dim 0 at bottom-left). Coloured cells are the 107 dimensions the probe requires; colour intensity indicates effective weight importance. The remaining 5,269 dimensions (grey) can be zeroed with no effect on the probe’s evaluation accuracy. Having established that the probe rea… view at source ↗
Figure 8
Figure 8. Figure 8: SST vs matched fine-tuned baseline. Training loss as raw values (light) and exponential moving average (solid); validation loss as evaluated. The baseline converges at a comparable validation loss, confirming it is a fair comparator; the ablation result itself is the SST–baseline delta on downstream evaluations (Section 5). 100 200 300 400 Training Step 0 2 4 6 8 10 Loss SST (θinit = −1.8) No bias (θinit =… view at source ↗
Figure 9
Figure 9. Figure 9: SST vs no-bias ablation variant. The unbiased checkpoint trains to higher loss and underperforms on downstream evaluations ( [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SST training and validation loss (standalone). 100 200 300 Training Step 0.0 0.5 1.0 1.5 2.0 2.5 Loss (a) Training loss 50 100 150 200 250 300 Training Step 0.2 0.3 0.4 0.5 0.6 0.7 Validation Loss (b) Validation loss [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Matched fine-tuned baseline training and validation loss (standalone). 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Global alpha statistics over training. Evolution of per-dimension blend coefficient statistics across training steps, showing that the alpha values adapt throughout training rather than remaining at their initialisation. The following figures provide the detailed per-layer analysis of the learned blend coefficients summarised in Section 3.3. Figures 13 and 14 show four representative layers; Figures 16 an… view at source ↗
Figure 13
Figure 13. Figure 13: Learned αl at four representative depths. Each panel shows one layer’s 5,376-dimensional blend coefficient vector reshaped into a 64 × 84 grid with the same hidden-dim index mapped to the same grid position in every panel. Colour encodes raw α on a shared scale (p1–p99 of the full 62-layer matrix). The full 62-layer version is given in [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-layer adaptation patterns at four representative depths. Each panel shows one layer’s deviation from initialisation, αl − αinit, on the same 64 × 84 hidden-dim reshape and for the same four layers as [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Per-layer specialisation profile across depth. Within-layer percentile bands of |αl,d − αinit| over the 5,376 hidden dimensions at each layer. The median (solid red) tracks aggregate adaptation per layer; the upper-percentile bands track the magnitude of the most-adapted dimensions. The four selected layers in Figures 13 and 14 (0, 30, 40, 61) are drawn from distinct phases of this profile. L0 L1 L2 L3 L4… view at source ↗
Figure 16
Figure 16. Figure 16: Learned alpha values, all 62 layers. Full version of [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-layer adaptation patterns, all 62 layers. Full version of [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Top-1024 overlap heatmap, zoomed to the first 10 generated positions (companion to [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Layer profile at basin shift positions, all sequence positions. Same metric as [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Layer profile over 512 generated positions. Same metric as [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Layer 15 halt signal probe output across all evaluation timesteps. Top: per-timestep histogram. MUST HALT timesteps (mean +3.66, std 1.16) separate from SAFE timesteps (mean −3.72, std 3.52) with a gap of 7.38 between class means. Middle: per-question strip plot. No SAFE or continue-question timestep lies above threshold, corresponding to zero overthinks. Bottom: every evaluation timestep at its actual it… view at source ↗
read the original abstract

Current transformers discard their rich latent residual stream between positions, reconstructing latent reasoning context at each new position and leaving potential reasoning capacity untapped. The State Stream Transformer (SST) V2 enables parameter-efficient reasoning in continuous latent space through an FFN-driven nonlinear recurrence at each decoder layer, where latent states are streamed horizontally across the full sequence via a learned blend. This same mechanism supports continuous latent deliberation per position at inference time, dedicating additional FLOPs to exploring abstract reasoning before committing to a token. A two-pass parallel training procedure resolves the sequential dependency of the recurrence to allow compute-efficient training. Hidden state analysis shows the state stream facilitates reasoning through exploration of distinct semantic basins in continuous latent space, where transitions at content-dependent positions move the model into a substantially different Bayesian posterior, directly influencing the latent space at future positions. We also find, via a learned probe, that at the first generated token position, the latent state already predicts whether the eventual answer will survive or break under additional latent computation for every subsequent position. Co-trained into an existing 27B backbone using only a small dataset of GSM8K examples, the SST delivers a +15.15 point gain over a fine-tuning-matched baseline on out-of-distribution GPQA-Diamond and cuts that same baseline's remaining GSM8K errors by 46%, together showing that the reasoning improvement is attributable to the architectural mechanism rather than scale or training data. On GPQA-Diamond, the resulting 27B SST also achieves higher accuracy than several larger open-weight and proprietary systems, including open-weight models up to 25 times larger.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes the State Stream Transformer (SST) V2, which augments standard transformers with an FFN-driven nonlinear recurrence per decoder layer that streams latent states horizontally across positions via learned blend weights. This enables continuous latent-space deliberation at inference by dedicating extra compute to explore reasoning before token emission. A two-pass parallel training procedure removes the sequential dependency for efficient training. Co-training the mechanism into a 27B backbone on a small GSM8K set yields +15.15 accuracy on out-of-distribution GPQA-Diamond versus a fine-tuning-matched baseline and reduces remaining GSM8K errors by 46%. Hidden-state analysis shows the stream explores distinct semantic basins, and a learned probe indicates the first-token latent state predicts whether the final answer will survive further latent computation.

Significance. If the reported gains can be isolated to the state-stream mechanism, the work would offer a parameter-efficient route to latent reasoning that improves out-of-distribution performance without increasing model scale. The parallel training schedule and basin-transition analysis constitute concrete technical contributions that could be adopted more broadly. The absence of controls that hold the training procedure fixed, however, leaves the central attribution claim open to alternative explanations based on optimization differences.

major comments (3)
  1. [Abstract and Results] Abstract and experimental results: The headline claim that the +15.15 GPQA-Diamond gain and 46% GSM8K error reduction are attributable to the nonlinear recurrence and horizontal state streaming (rather than the two-pass parallel training procedure) is not supported by the current controls. The baseline is described only as 'fine-tuning-matched' and cannot employ the identical two-pass schedule that resolves recurrence dependencies, so differences in gradient flow, effective regularization, or optimization trajectory remain unisolated.
  2. [Hidden State Analysis] Hidden-state analysis section: The probe that predicts answer survival from the first generated token's latent state is trained on outcomes produced by the same SST model, introducing circularity that weakens the claim that the probe demonstrates genuine latent-space reasoning independent of the model's own predictions.
  3. [Experimental Evaluation] Experimental evaluation: No error bars, standard deviations across runs, or full protocol details (including exact data splits, learning-rate schedules, and whether the baseline receives equivalent total compute) are reported for the GPQA-Diamond and GSM8K results, preventing assessment of whether the observed improvements exceed statistical noise.
minor comments (1)
  1. [Methods] The mathematical definition of the learned blend weights and the precise form of the nonlinear recurrence could be stated more explicitly with equations to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and experimental results: The headline claim that the +15.15 GPQA-Diamond gain and 46% GSM8K error reduction are attributable to the nonlinear recurrence and horizontal state streaming (rather than the two-pass parallel training procedure) is not supported by the current controls. The baseline is described only as 'fine-tuning-matched' and cannot employ the identical two-pass schedule that resolves recurrence dependencies, so differences in gradient flow, effective regularization, or optimization trajectory remain unisolated.

    Authors: We agree that a control holding the training procedure exactly fixed would strengthen attribution. The two-pass schedule is required to train the recurrence in parallel and is thus inseparable from the SST mechanism itself; a standard transformer baseline cannot use it. The baseline matches the fine-tuning data, steps, and compute budget as closely as possible under standard single-pass training. The large OOD gains on GPQA-Diamond (unseen during co-training) make an optimization-artifact explanation less likely, but we will revise the abstract, results, and discussion to qualify the attribution claim, explicitly note the training-procedure difference as a potential confound, and add it to the limitations section. revision: partial

  2. Referee: [Hidden State Analysis] Hidden-state analysis section: The probe that predicts answer survival from the first generated token's latent state is trained on outcomes produced by the same SST model, introducing circularity that weakens the claim that the probe demonstrates genuine latent-space reasoning independent of the model's own predictions.

    Authors: The probe is an analysis tool that tests whether the latent state at the first generated token already encodes the outcome of subsequent latent deliberation steps performed by the same model. This is by design: it demonstrates that the state stream has compressed reasoning progress into an early latent representation. We do not claim the probe operates independently of the model; rather, it reveals an internal property of SST's latent dynamics. We will revise the hidden-state analysis section to clarify the probe's purpose, remove any phrasing that could imply full independence, and emphasize that the result supports the utility of continuous latent computation. revision: yes

  3. Referee: [Experimental Evaluation] Experimental evaluation: No error bars, standard deviations across runs, or full protocol details (including exact data splits, learning-rate schedules, and whether the baseline receives equivalent total compute) are reported for the GPQA-Diamond and GSM8K results, preventing assessment of whether the observed improvements exceed statistical noise.

    Authors: We will add error bars computed over at least three independent runs with different random seeds, report standard deviations, and expand the experimental protocol appendix with exact data splits, learning-rate schedules, optimizer settings, and a compute-equivalence table confirming that the baseline receives matched total optimization steps (adjusted for the two-pass overhead in SST). These additions will allow direct assessment of statistical reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on direct empirical measurements of accuracy gains on external benchmarks (GPQA-Diamond +15.15 points, 46% GSM8K error reduction) after co-training a 27B backbone on a small GSM8K set, compared against a fine-tuning-matched baseline. These are independent evaluations not derived from any internal fitted parameters, self-citations, or equations that reduce to inputs by construction. The learned probe is described only as an additional analysis tool ('we also find, via a learned probe...') for interpreting latent states and does not underpin or define the accuracy or attribution claims. No self-definitional steps, uniqueness theorems imported from prior author work, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The two-pass parallel training is presented as an enabling component of the proposed architecture rather than a hidden input that forces the reported outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into exact parameters; the state stream itself functions as a new architectural component whose behavior depends on learned weights and the two-pass training schedule.

free parameters (1)
  • learned blend weights
    Controls how latent states are streamed and blended across positions; fitted during co-training on GSM8K.
axioms (1)
  • domain assumption Retaining and updating a continuous latent residual stream across positions improves reasoning capacity over standard per-position reconstruction
    Core premise invoked to motivate the architecture and explain the observed semantic-basin transitions.
invented entities (1)
  • State stream no independent evidence
    purpose: Horizontal carrier of nonlinear recurrent latent states for continuous deliberation
    New architectural construct introduced to enable the claimed latent-space reasoning; no independent falsifiable prediction outside the model is provided.

pith-pipeline@v0.9.0 · 5587 in / 1566 out tokens · 30472 ms · 2026-05-09T20:26:02.744073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 38 canonical work pages · 23 internal anchors

  1. [1]

    Piantadosi, and Edward A

    Evelina Fedorenko, Steven T. Piantadosi, and Edward A. F. Gibson. Language is primarily a tool for communication rather than thought.Nature, 630(8017):575–586, 2024. doi: 10.1038/s41586-024-07522-w. URL https: //www.nature.com/articles/s41586-024-07522-w

  2. [2]

    A mathematical framework for transformer circuits, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  3. [3]

    From thought to action: How a hierarchy of neural dynamics supports language production

    Mingfang Zhang, Jarod Lévy, Stéphane d’Ascoli, Jérémy Rapin, F.-Xavier Alario, Pierre Bourdillon, Svetlana Pinet, and Jean-Rémi King. From thought to action: How a hierarchy of neural dynamics supports language production, 2025. URLhttps://arxiv.org/abs/2502.07429

  4. [4]

    A survey on latent reasoning, 2025

    Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun...

  5. [5]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https: //arxiv.org/abs/2001.08361

  6. [6]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR 2017), 2017. URLhttps://arxiv.org/abs/1701.06538

  7. [7]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS 2022), 2022. URLhttps://arxiv.org/abs/2201.11903

  8. [8]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024. https://openai.com/index/ learning-to-reason-with-llms/

  9. [9]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=

    DeepSeek-AI. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://www.nature.com/articles/ s41586-025-09422-z

  10. [10]

    Scaling LLM test-time compute optimally can be more effective than scaling model parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. InInternational Conference on Learning Representations (ICLR 2025),

  11. [11]

    URLhttps://arxiv.org/abs/2408.03314. Oral. 23

  12. [12]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InAdvances in Neural Information Processing Systems, 2025

  13. [13]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. InInternational Conference on Learning Representations (ICLR 2019), 2018. URL https://arxiv.org/abs/ 1807.03819

  14. [14]

    4) Giannou, A., Rajput, S., Sohn, J.-y., Lee, K., Lee, J

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning (ICML 2023), 2023. URLhttps://arxiv.org/abs/2301.13196

  15. [15]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. URLhttps://arxiv.org/abs/2412.06769

  16. [16]

    Hierarchical reasoning model, 2025

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yad- kori. Hierarchical reasoning model, 2025. URLhttps://arxiv.org/abs/2506.21734

  17. [17]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024. URLhttps://arxiv.org/abs/2312.00752

  18. [18]

    The state stream transformer: Emergent metacognitive behaviours through latent state persistence,

    Thea Aviss. The state stream transformer: Emergent metacognitive behaviours through latent state persistence,

  19. [19]

    URLhttps://arxiv.org/abs/2501.18356

  20. [20]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS 2017), 2017. URLhttps://arxiv.org/abs/1706.03762

  21. [21]

    1511.06732 , archiveprefix =

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representations (ICLR 2016), 2015. URL https://arxiv.org/abs/1511.06732

  22. [22]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR 2022), 2021. URL https://arxiv.org/abs/ 2111.00396

  23. [23]

    T., Warrington, A., and Linderman, S

    Jimmy T.H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InInternational Conference on Learning Representations (ICLR 2023), 2022. URL https://arxiv. org/abs/2208.04933

  24. [24]

    Executable code actions elicit better LLM agents, 2024

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InInternational Conference on Machine Learning (ICML 2024), 2024. URL https://arxiv.org/abs/2402.01030

  25. [25]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

  26. [26]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems (NeurIPS 2023), 2023. URL https://arxiv. org/abs/2305.14314

  27. [27]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs), 2016. URL https://arxiv.org/ abs/1606.08415

  28. [28]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR 2022), 2021. URLhttps://arxiv.org/abs/2106.09685

  29. [29]

    Llama 3.1 70b instruct model card, 2024

    Meta. Llama 3.1 70b instruct model card, 2024. Hugging Face, https://huggingface.co/meta-llama/ Llama-3.1-70B-Instruct

  30. [30]

    Qwen2.5: A party of foundation models, 2024

    Qwen Team. Qwen2.5: A party of foundation models, 2024. Qwen Blog, https://qwen.ai/blog?id=qwen2. 5. 24

  31. [31]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report. Technical report, Google, 2025. URL https://arxiv.org/abs/ 2503.19786

  32. [32]

    Llama 3.1 405b instruct model card, 2024

    Meta. Llama 3.1 405b instruct model card, 2024. Hugging Face, https://huggingface.co/meta-llama/ Llama-3.1-405B-Instruct

  33. [33]

    Llama 3.3 70b instruct model card, 2024

    Meta. Llama 3.3 70b instruct model card, 2024. Hugging Face, https://huggingface.co/meta-llama/ Llama-3.3-70B-Instruct

  34. [34]

    Gpt-4o system card, 2024.https://openai.com/index/gpt-4o-system-card/

    OpenAI. Gpt-4o system card, 2024.https://openai.com/index/gpt-4o-system-card/

  35. [35]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report, 2024. URLhttps://arxiv.org/abs/2412.19437

  36. [36]

    Gemini 2.0 flash model card

    Google DeepMind. Gemini 2.0 flash model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-2-0-Flash-Model-Card.pdf , 2025. URL https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-2-0-Flash-Model-Card.pdf

  37. [37]

    The Gemini 2.0 family expands

    Google for Developers. The Gemini 2.0 family expands. https://developers.googleblog. com/en/gemini-2-family-expands/ , 2025. URL https://developers.googleblog.com/en/ gemini-2-family-expands/

  38. [38]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark. InConference on Language Modeling (COLM 2024), 2023. URLhttps://arxiv.org/abs/2311.12022

  39. [39]

    Gemini 2.5 pro model card, 2025

    Google DeepMind. Gemini 2.5 pro model card, 2025. https://deepmind.google/technologies/gemini/ pro/

  40. [40]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021), 2021. URL https: //arxiv.org/abs/2103.03874

  41. [41]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations (ICLR 2024), 2023. URLhttps://arxiv.org/abs/2305.20050

  42. [42]

    Evaluating large language models trained on code,

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code,

  43. [43]

    URLhttps://arxiv.org/abs/2107.03374

  44. [44]

    When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

    Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang. When more thinking hurts: Overthinking in LLM test-time compute scaling, 2026. URLhttps://arxiv.org/abs/2604.10739

  45. [45]

    The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

    Alexander Hägele, Aryo Pradipta Gema, Henry Sleight, Ethan Perez, and Jascha Sohl-Dickstein. The hot mess of AI: How does misalignment scale with model intelligence and task complexity?, 2026. URL https: //arxiv.org/abs/2601.23045

  46. [46]

    Brevity constraints reverse performance hierarchies in language models, 2026

    MD Azizul Hakim. Brevity constraints reverse performance hierarchies in language models, 2026. URL https://arxiv.org/abs/2604.00025

  47. [47]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks, 2016. URL https://arxiv.org/abs/ 1603.08983

  48. [48]

    Thomas M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.IEEE Transactions on Electronic Computers, EC-14(3):326–334, 1965

  49. [49]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR 2019), 2017. URLhttps://arxiv.org/abs/1711.05101

  50. [50]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. InAdvances in Neural Information Processing Systems (NeurIPS 2019), 2019. URLhttps://arxiv.org/abs/1909.01377. Spotlight Oral. 25

  51. [51]

    Parallelizing non-linear sequential models over the sequence length.arXiv preprint arXiv:2309.12252, 2023

    Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. Parallelizing non-linear sequential models over the sequence length. InInternational Conference on Learning Representations (ICLR 2024), 2023. URLhttps://arxiv.org/abs/2309.12252

  52. [52]

    ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models

    Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, and Luca Zappella. ParaRNN: Unlocking parallel training of nonlinear RNNs for large language models. InInternational Conference on Learning Representations (ICLR 2026), 2025. URLhttps://arxiv.org/abs/2510.21450. Oral

  53. [53]

    Combettes and Jean-Christophe Pesquet

    Patrick L. Combettes and Jean-Christophe Pesquet. Lipschitz certificates for layered network structures driven by averaged activation operators, 2019. URLhttps://arxiv.org/abs/1903.01014

  54. [54]

    Zhang and R

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InAdvances in Neural Information Processing Systems (NeurIPS 2019), 2019. URLhttps://arxiv.org/abs/1910.07467

  55. [55]

    Time in hours: {time_hours}

    Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J. Cree. Regularisation of neural networks by enforcing Lipschitz continuity.Machine Learning, 110:393–416, 2021. A Training A.1 Dataset and task formulation The model is fine-tuned on GSM8K grade-school math problems formulated as CodeACT tasks. In the CodeACT paradigm, the model calls tools by emi...

  56. [56]

    Remove all timesteps belonging to the held-out question from the training set

  57. [57]

    Train a fresh probe from scratch on the remaining timesteps (identical pipeline to Appendix F.1: same architecture, seed, epochs, class balancing)

  58. [58]

    Record whether the held-out question is correctly classified

    Run the full GPQA-Diamond evaluation with the freshly trained probe (procedure in Appendix F.5). Record whether the held-out question is correctly classified. Result:29of48held-outMUST HALTquestions are correctly classified (60%). Null hypothesis and statistical test.The null hypothesis is memorisation: the probe stores the trainingMUST HALT patterns as a...