pith. the verified trust layer for science. sign in

arxiv: 2508.16745 · v3 · submitted 2025-08-22 · 💻 cs.LG · cs.AI

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Pith reviewed 2026-05-18 20:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-step reasoningcellular automatarule inferencerecurrencememory augmentationtest-time computereasoning depthneural architectures
0
0 comments X p. Extension

The pith

Neural models infer hidden rules accurately but lose performance over longer chains of reasoning steps

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether neural networks can discover an unknown local rule from short state sequences and then apply that rule over many future steps. It uses a cellular automata setup where the rules seen during training are completely different from those used at test time, so success requires genuine inference rather than memorization. Architectures trained from scratch reach high accuracy on the immediate next step yet degrade sharply once more intermediate applications of the rule are required. Adding recurrence, memory, or extra computation at test time lengthens the effective chain that can be handled, but the gains remain limited. Readers interested in why current AI systems still falter on extended reasoning problems will find the controlled isolation of multi-step chaining directly relevant.

Core claim

In the 1dCA task with disjoint training and test rules, most neural architectures learn to infer the hidden local rule and achieve high next-step prediction accuracy, yet performance drops sharply once the number of required intermediate reasoning steps grows. Increasing model depth is crucial; augmenting with recurrence, memory, or test-time compute scaling extends reachable depth but leaves performance bounded.

What carries the argument

The disjoint-rule 1d cellular automata prediction task, which forces models to infer a hidden local update rule from examples and then chain repeated applications of that rule over many steps.

If this is right

  • Deeper models reach accurate predictions over longer chains of rule applications than shallower ones.
  • Recurrent connections let the model reuse the inferred rule across additional steps without losing state information as quickly.
  • Extra memory or test-time compute allows the model to simulate greater depth and improves results on multi-step predictions.
  • Even after these extensions, accuracy eventually declines, indicating that current architectures still face a hard limit on effective reasoning length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining recurrence, memory, and test-time scaling might produce larger gains than any one technique alone, which could be tested directly in the same 1dCA setup.
  • The observed bound suggests that reliable long-horizon reasoning may need new ways to detect and correct errors that accumulate over many steps.
  • This controlled task could serve as a benchmark for comparing future architectures on pure rule chaining without language or memorization confounds.

Load-bearing premise

The 1dCA task with disjoint train/test rules isolates genuine multi-step rule chaining rather than other factors such as accumulating prediction error or sensitivity to initial state representation.

What would settle it

A model achieving sustained high accuracy across arbitrarily many future steps on the disjoint-rule 1dCA task without any recurrence, memory, or added test-time compute would falsify the claim that performance is inherently bounded by reasoning depth.

Figures

Figures reproduced from arXiv: 2508.16745 by Arman Bolatov, Artem Shelmanov, Aydar Bulatov, Besher Hassan, Bilal Elbouardi, Daniil Orel, Ivan Rodkin, Konstantin Smirnov, Mikhail Burtsev, Preslav Nakov, Timothy Baldwin, Yuri Kuratov.

Figure 1
Figure 1. Figure 1: Learning One-dimensional Cellular Automata. (a) Update of state with local rule. (b) Orbit of 1dCA is a sequence of binary strings of size W = 20. The first k = 10 states marked by the red rectangle encode transformer input. (c) Given a part of the orbit a model learns to predict the next state (O-S). subtasks: (i) inferring the hidden law that governs state transitions and (ii) chaining that law to projec… view at source ↗
Figure 2
Figure 2. Figure 2: Single-step accuracy is near-perfect across models, but multi-step performance collapses. (a) Exact-match accuracy for single-step state prediction (O-S): all models except LSTM achieve >95 %. (b) Bit-wise accuracy for rule inference (O-RS): most architectures recover the hidden Boolean rule, yet ARMT trails the rest. (c) GPTNeox accuracy on variable-horizon prediction across the four task variants (O-S, O… view at source ↗
Figure 3
Figure 3. Figure 3: Depth — not width — drives multi-step accuracy. Exact-match accuracy for look-ahead horizons k ∈ {1, 2, 3, 4} as a function of (a) transformer layer count and (b) embedding dimension dmodel. Deeper networks boost performance sharply for k≥2 and plateau beyond six layers, whereas widening the model yields only marginal gains across all horizons. Finally, we explored the scenario where the local rule ρ is ex… view at source ↗
Figure 4
Figure 4. Figure 4: ACT significantly improves computational abilities of transformer-based models in multi-step prediction. Exact match of the x (T +k) state prediction for look-ahead steps k ∈ {1, 2, 3, 4} for O-S training objective. Different models are shown in different colors. Augmenting models with ACT5 has little effect on all architectures except GPTNeox, which sees improved performance at k = 2 but not at k = 3, 4. … view at source ↗
Figure 5
Figure 5. Figure 5: Without supervision on intermediate reasoning steps RL training with GRPO allows the model to extrapolate reasoning on 3 steps forward. While model and layer ACT variants extend it to 2 steps forward. However, 4 steps remains a challenging task for all approaches. All models are trained from GPTNeox checkpoint trained on 1 step forward prediction task. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: With step-by-step supervision, the CoT approach significantly outperforms the in￾depth approach of ACT. While the performance of the depth-based methods decreases with the increasing look-ahead steps, breadth-based CoT succeeds in predicting up to all 4 steps. However, among the models without autoregressive generation, GPTNeox and ARMT with both ACT and O-O supervision perform the best. (LACT) and across … view at source ↗
Figure 7
Figure 7. Figure 7: ACT significantly reduces the required models’ depth for the majority of group multiplication tasks. Each chart contains the information about the minimal required number of layers for solving task of given length with 70% exact match accuracy. GPTNeox and Mamba being T C0 -limited models require more layers for solving deeper (longer in this case) tasks, while ARMT and LSTM solve them with constant number… view at source ↗
Figure 8
Figure 8. Figure 8: With GRPO as well as with ACT and Orbit-Orbit training depth of reasoning can be significantly extended. Average DepthScore = 1 + P4 i=2 acc(i), where acc(i) is the accuracy of predicting the (10 + i)th state based on the first 10 states. Broader implications for LLM reasoning—and beyond. Our results align with a growing body of evidence that reasoning failures often stem from insufficient depth allocation… view at source ↗
Figure 9
Figure 9. Figure 9: ACT outperforms the base model on multiple prediction horizons task. Exact match accuracy (mean ± std) for cellular automata state prediction across different look-ahead horizons. Models receive initial 10 states followed by a special shift token (1-4) indicating prediction horizon. 10110111001000110100<sep>11101001101111101100<sep>10111011010000111011<sep> 11001110111011101100<sep>10111011001100111011<sep… view at source ↗
Figure 10
Figure 10. Figure 10: Fixed Computation Time (FCT) with 3 iteration steps performs on par with Adaptive Computation Time (ACT) in Orbit-State task. Exact match accuracy (mean ± std) for cellular automata state prediction across different look-ahead horizons. 1 2 3 4 Look-ahead, steps 0 20 40 60 80 100 Exact match GPTNeox + ACT + O-O GPTNeox + FCT + O-O ARMT + ACT + O-O ARMT + FCT + O-O [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Fixed Computation Time (FCT) with 3 iteration steps underperforms Adaptive Computation Time (ACT) in Orbit-Orbit task. Exact match accuracy (mean ± std) for cellular automata state prediction across different look-ahead horizons. F.2 Model-ACT vs Layer-ACT [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layer-ACT performs similar or better compared to Model-ACT. Exact match on cellular [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. The code is available on github: https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces a 1D cellular automata (1dCA) benchmark to investigate multi-step reasoning in neural networks, designed to avoid memorization by using disjoint sets of rules for training and testing. The core experiment requires models to infer a local rule from a short sequence of states and then apply it autoregressively over multiple future steps. Results indicate high accuracy for single-step predictions across various architectures trained from scratch, but a sharp decline as the number of required reasoning steps increases. The authors explore enhancements through increased model depth, recurrence, memory mechanisms, and test-time compute scaling, observing improvements that are nonetheless bounded. Additionally, large language models are shown to underperform on a natural language formulation of the task.

Significance. If substantiated with clearer controls, the findings would be significant for the field of machine learning and AI reasoning. The controlled nature of the 1dCA task provides a clean testbed for studying inference chaining without the confounds of natural language or memorization. The demonstration that recurrence and test-time scaling can extend effective reasoning depth, even if bounded, offers actionable directions for architecture design. The public availability of the code further strengthens the contribution by enabling verification and extension by the community.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and evaluation protocol: the central claim that performance drops sharply with more intermediate reasoning steps reflects a limit on multi-step rule inference depth is load-bearing, yet the 1dCA autoregressive setup does not include controls (e.g., teacher-forced evaluation or oracle rule provision) to separate genuine chaining difficulty from compounding prediction errors due to early state inaccuracies. This leaves open whether recurrence/memory/test-time scaling primarily mitigates depth limits or improves robustness to error propagation.
  2. [Results] Results on bounded improvements: the statement that extensions via recurrence, memory, or test-time compute 'remain bounded' is central but lacks precise quantification (specific accuracy curves vs. depth/scaling factors, with error bars) to establish the magnitude and robustness of the claimed gains over baseline depth increases.
minor comments (3)
  1. The abstract and results would benefit from inclusion of quantitative tables reporting exact next-step and multi-step accuracies, standard deviations across runs, and full ablations for each proposed extension (recurrence, memory, test-time compute).
  2. Clarify notation and implementation details for the 'associative recurrent memory transformer' architecture referenced in the code repository, including how recurrence and memory are integrated with the base transformer.
  3. Figures showing performance vs. reasoning depth should include error bars, legends, and axis labels for clarity; the natural-language LLM proxy task would benefit from explicit prompt templates and failure mode analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the paper accordingly to improve the evaluation protocol and results presentation.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and evaluation protocol: the central claim that performance drops sharply with more intermediate reasoning steps reflects a limit on multi-step rule inference depth is load-bearing, yet the 1dCA autoregressive setup does not include controls (e.g., teacher-forced evaluation or oracle rule provision) to separate genuine chaining difficulty from compounding prediction errors due to early state inaccuracies. This leaves open whether recurrence/memory/test-time scaling primarily mitigates depth limits or improves robustness to error propagation.

    Authors: We agree that the original autoregressive setup leaves some ambiguity between chaining limits and error propagation. In the revised manuscript, we have added teacher-forced evaluation (providing ground-truth states at each step) and an oracle rule provision condition. These controls show that accuracy still declines sharply with depth even when prior states are perfect, indicating a genuine limit on multi-step rule inference rather than solely error accumulation. The new results clarify that recurrence, memory, and test-time compute primarily extend effective reasoning depth. We have updated Section 4 with the revised protocol and added corresponding figures. revision: yes

  2. Referee: [Results] Results on bounded improvements: the statement that extensions via recurrence, memory, or test-time compute 'remain bounded' is central but lacks precise quantification (specific accuracy curves vs. depth/scaling factors, with error bars) to establish the magnitude and robustness of the claimed gains over baseline depth increases.

    Authors: We acknowledge the need for more precise quantification. The revised results section now includes detailed accuracy curves plotting performance against reasoning depth for different depths, recurrence steps, memory sizes, and test-time compute budgets. All curves include error bars from multiple independent runs. A new table summarizes the maximum reliable depth achieved by each method relative to baseline depth scaling. These additions demonstrate the magnitude of gains and confirm their bounded nature, with improvements plateauing at higher scaling factors. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation chain

full rationale

The paper reports experimental results on neural architectures trained from scratch in a 1dCA framework with disjoint train/test rules. Claims about performance drops with increasing reasoning steps and improvements from recurrence, memory, or test-time compute are supported by measured accuracies on held-out data rather than any first-principles derivation, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the inputs by construction; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the empirical observation that performance degrades with reasoning depth and that certain architectural extensions mitigate but do not eliminate the degradation. No mathematical derivation or free parameters are introduced; the work relies on standard assumptions that the cellular-automata update is deterministic and local and that the disjoint-rule split prevents memorization.

axioms (2)
  • domain assumption The cellular automata update rule is strictly local and deterministic.
    Invoked when defining the task as requiring inference of a hidden local rule from state sequences.
  • domain assumption Disjoint training and test rules eliminate memorization of specific rule instances.
    Stated explicitly in the abstract as the mechanism that forces genuine rule inference.

pith-pipeline@v0.9.0 · 5740 in / 1493 out tokens · 37216 ms · 2026-05-18T20:44:43.183204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diagnosing CFG Interpretation in LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 14 internal anchors

  1. [1]

    M. Aach, J. H. Goebbert, and J. Jitsev. Generalization over different cellular automata rules learned by a deep feed-forward neural network, 2021

  2. [2]

    L. M. Antunes. Cellpylib: A python library for working with cellular automata. Journal of Open Source Software, 6(67):3608, 2021. doi: 10.21105/joss.03608. URL https://doi.org/ 10.21105/joss.03608

  3. [3]

    J. A. Berkovich and M. J. Buehler. Lifegpt: Topology-agnostic generative pretrained transformer model for cellular automata. arXiv preprint arXiv:2409.12182, 2024

  4. [4]

    Besta, N

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024

  5. [5]

    Bhattamishra, A

    S. Bhattamishra, A. Patel, and N. Goyal. On the computational power of transformers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286, 2020

  6. [6]

    Bibin and A

    A. Bibin and A. Dereventsov. Data-centric approach to constrained machine learning: A case study on conway’s game of life.arXiv preprint arXiv:2408.12778, 2024

  7. [7]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

  8. [8]

    Bulatov, Y

    A. Bulatov, Y . Kuratov, and M. S. Burtsev. Recurrent memory transformer. arXiv preprint arXiv:2207.06881, 2022

  9. [9]

    Bulatov, Y

    A. Bulatov, Y . Kuratov, and M. S. Burtsev. Recurrent memory transformer, 2022

  10. [10]

    Chevalier, A

    A. Chevalier, A. Wettig, A. Ajith, and D. Chen. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023

  11. [11]

    G. Cybenko. Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:183–192, 1989

  12. [12]

    Z. Dai, Z. Yang, Y . Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019

  13. [13]

    Dehghani, S

    M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=HyzdRiR9Y7

  14. [14]

    Deletang, A

    G. Deletang, A. Ruoss, J. Grau-Moya, T. Genewein, L. K. Wenliang, E. Catt, C. Cundy, M. Hutter, S. Legg, J. Veness, et al. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, 2023

  15. [15]

    Dziri, X

    N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y . Lin, S. Welleck, P. West, C. Bhagavatula, R. Le Bras, et al. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024. 12

  16. [16]

    V . Elser. Reconstructing cellular automata rules from observations at nonconsecutive times. Physical Review E, 104(3):034301, 2021

  17. [17]

    G. Feng, B. Zhang, Y . Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024

  18. [18]

    G. Gala, D. Grattarola, and E. Quaeghebeur. E (n)-equivariant graph neural cellular automata. arXiv preprint arXiv:2301.10497, 2023

  19. [19]

    J. P. Gandarela, D. S. Carvalho, and A. Freitas. Inductive learning of logical theories with llms: A complexity-graded analysis. arXiv preprint arXiv:2408.16779, 2024

  20. [20]

    W. Gilpin. Cellular automata as convolutional neural networks. Physical Review E, 100(3): 032402, 2019

  21. [21]

    Better & Faster Large Language Models via Multi-token Prediction

    F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

  22. [22]

    Goyal, Z

    S. Goyal, Z. Ji, A. S. Rawat, A. K. Menon, S. Kumar, and V . Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2024

  23. [23]

    Grattarola, L

    D. Grattarola, L. Livi, and C. Alippi. Learning graph cellular automata. Advances in Neural Information Processing Systems, 34:20983–20994, 2021

  24. [24]

    A. Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016

  25. [25]

    Neural Turing Machines

    A. Graves, G. Wayne, and I. Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

  26. [26]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  27. [27]

    A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021

  28. [28]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  29. [29]

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

  30. [30]

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian. Training large language models to reason in a continuous latent space, 2024. URL https://arxiv.org/abs/2412. 06769

  31. [31]

    Teaching large language models to reason with reinforcement learning.arXiv preprint arXiv:2403.04642,

    A. Havrilla, Y . Du, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, S. Sukhbaatar, and R. Raileanu. Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642, 2024

  32. [32]

    Herel and T

    D. Herel and T. Mikolov. Thinking tokens for language modeling. arXiv preprint arXiv:2405.08644, 2024

  33. [33]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

  34. [34]

    W. H. Holliday and M. Mandelkern. Conditional and modal reasoning in large language models. arXiv preprint arXiv:2401.17169, 2024

  35. [35]

    Hornik, M

    K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. 13

  36. [36]

    B. Kang, H. Kumar, M. Lee, B. Chakraborty, and S. Mukhopadhyay. Learning locally interacting discrete dynamical systems: Towards data-efficient and scalable prediction. In A. Abate, M. Cannon, K. Margellos, and A. Papachristodoulou, editors, Proceedings of the 6th Annual Learning for Dynamics and Control Conference , volume 242 of Proceedings of Machine L...

  37. [37]

    Karloff, S

    H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. InProceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms , pages 938–948. SIAM, 2010

  38. [38]

    Training Language Models to Self-Correct via Reinforcement Learning

    A. Kumar, V . Zhuang, R. Agarwal, Y . Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024

  39. [39]

    and Sabharwal, A

    W. Merrill and A. Sabharwal. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023

  40. [40]

    Merrill and A

    W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023

  41. [41]

    Merrill and A

    W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. In The Twelfth International Conference on Learning Representations, 2024

  42. [42]

    Merrill, A

    W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022

  43. [43]

    Merrill, J

    W. Merrill, J. Petty, and A. Sabharwal. The illusion of state in state-space models. InProceedings of the 41st International Conference on Machine Learning, pages 35492–35506, 2024

  44. [44]

    Mondorf and B

    P. Mondorf and B. Plank. Liar, liar, logical mire: A benchmark for suppositional reasoning in large language models. arXiv preprint arXiv:2406.12546, 2024

  45. [45]

    Mordvintsev, E

    A. Mordvintsev, E. Randazzo, E. Niklasson, and M. Levin. Growing neural cellular automata. Distill, 5(2):e23, 2020

  46. [46]

    Nowak, A

    F. Nowak, A. Svete, A. Butoi, and R. Cotterell. On the representational capacity of neural language models with chain-of-thought reasoning. In L.-W. Ku, A. Martins, and V . Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12510–12548, Bangkok, Thailand, Aug. 2024. A...

  47. [47]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/ , 2024. Accessed: 2024-09-23

  48. [48]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  49. [49]

    Pande and D

    R. Pande and D. Grattarola. Hierarchical neural cellular automata. In Artificial Life Conference Proceedings 35, volume 2023, page 20. MIT Press One Rogers Street, Cambridge, MA 02142- 1209, USA journals-info . . . , 2023

  50. [50]

    Pérez, P

    J. Pérez, P. Barceló, and J. Marinkovic. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35, 2021

  51. [51]

    J. Pfau, W. Merrill, and S. R. Bowman. Let’s think dot by dot: Hidden computation in transformer language models. In First Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=NikbrdtYvG

  52. [52]

    J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv. org/abs/1911.05507. 14

  53. [53]

    A. D. Richardson, T. Antal, R. A. Blythe, and L. J. Schumacher. Learning spatio-temporal patterns with neural cellular automata. PLOS Computational Biology, 20(4):e1011589, 2024

  54. [54]

    Rodkin, Y

    I. Rodkin, Y . Kuratov, A. Bulatov, and M. Burtsev. Associative recurrent memory transformer,

  55. [55]

    URL https://arxiv.org/abs/2407.04841

  56. [56]

    Sanford, D

    C. Sanford, D. Hsu, and M. Telgarsky. Transformers, parallel computation, and logarithmic depth. In Forty-first International Conference on Machine Learning, 2024

  57. [57]

    Sanford, D

    C. Sanford, D. J. Hsu, and M. Telgarsky. Representational strengths and limitations of trans- formers. Advances in Neural Information Processing Systems, 36, 2024

  58. [58]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  59. [59]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  60. [60]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://arxiv.org/abs/2506.06941

  61. [61]

    J. M. Springer and G. T. Kenyon. It’s hard for neural networks to learn the game of life. In2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021

  62. [62]

    Transformers as recognizers of formal languages: A survey on expressivity

    L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin. Transformers as recognizers of formal languages: A survey on expressivity. arXiv preprint arXiv:2311.00208, 2023

  63. [63]

    Strobl, W

    L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12:543–561, 2024

  64. [64]

    Tesfaldet, D

    M. Tesfaldet, D. Nowrouzezahrai, and C. Pal. Attention-based neural cellular automata. Ad- vances in Neural Information Processing Systems, 35:8174–8186, 2022

  65. [65]

    Solving math word problems with process- and outcome-based feedback

    J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

  66. [66]

    Valmeekam, K

    K. Valmeekam, K. Stechly, and S. Kambhampati. Llms still can’t plan; can lrms? a preliminary evaluation of openai’s o1 on planbench.arXiv preprint arXiv:2409.13373, 2024

  67. [67]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. In Advances in neural information processing systems , pages 5998–6008, 2017. URL http://papers.nips.cc/paper/ 7181-attention-is-all-you-need

  68. [68]

    Y . Wan, W. Wang, Y . Yang, Y . Yuan, J.-t. Huang, P. He, W. Jiao, and M. Lyu. LogicAsker: Evaluating and improving the logical reasoning ability of large language models. In Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2124–2155, Miami, Florida, USA, Nov...

  69. [69]

    P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In L.-W. Ku, A. Martins, and V . Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand...

  70. [70]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 15

  71. [71]

    Memory Networks

    J. Weston, S. Chopra, and A. Bordes. Memory networks. In Y . Bengio and Y . LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1410.3916

  72. [72]

    Wulff and J

    N. Wulff and J. A. Hertz. Learning cellular automaton dynamics with neural networks.Advances in Neural Information Processing Systems, 5, 1992

  73. [73]

    L. Yang, K. Lee, R. Nowak, and D. Papailiopoulos. Looped transformers are better at learning learning algorithms, 2024. URL https://arxiv.org/abs/2311.12424

  74. [74]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

  75. [75]

    Q. Yu, Z. He, S. Li, X. Zhou, J. Zhang, J. Xu, and D. He. Enhancing auto-regressive chain-of- thought through loop-aligned reasoning, 2025. URL https://arxiv.org/abs/2502.08482

  76. [76]

    C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019

  77. [77]

    Zhang, M

    X. Zhang, M. Abdul-Mageed, and L. V . Lakshmanan. Autoregressive+ chain of thought= recurrent: Recurrence’s role in language models’ computability and a revisit of recurrent transformer. arXiv preprint arXiv:2409.09239, 2024

  78. [78]

    iterations

    Y . Zhang and H. Bölcskei. Cellular automata, many-valued logic, and deep neural networks. arXiv preprint arXiv:2404.05259, 2024. 16 A Related Work Computational Expressivity. Sanford et al. [56] show that in setups where the input context length grows but the model depth remains constant, transformers achieve logarithmic complexity scaling in input size ...