pith. machine review for the scientific record. sign in

arxiv: 2605.12466 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL· cs.NE

Recognition: 2 theorem links

· Lean Theorem

Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley, Paria Rashidinejad

Pith reviewed 2026-05-13 05:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.NE
keywords attractor modelsfixed-point solvingimplicit differentiationlooped transformerslanguage modelingreasoning tasksequilibrium internalization
0
0 comments X

The pith

Attractor Models solve for fixed points to enable stable iterative refinement in language and reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Attractor Models as a way to gain the benefits of looped computation without its usual training problems. A backbone network proposes an embedding and an attractor module then solves for the fixed point of a refinement function, with gradients flowing through implicit differentiation rather than unrolled steps. This keeps memory usage fixed even as effective depth grows and lets the number of iterations adapt to convergence. In practice the approach yields lower perplexity than Transformers of similar size, improves downstream accuracy, and lets very small models reach high performance on hard reasoning problems where larger systems struggle. The models also learn to place their initial proposal near the equilibrium, so the attractor can often be dropped at inference time with little loss.

Core claim

Attractor Models refine proposed output embeddings by solving for the fixed point of the attractor module, with gradients computed via implicit differentiation. This formulation keeps training memory constant in effective depth and selects iterations adaptively by convergence. The resulting models outperform standard Transformers on language-model pretraining while reducing cost, and small instances achieve high accuracy on difficult reasoning benchmarks. Fixed-point training further induces equilibrium internalization, allowing the solver to be removed at inference with minimal degradation.

What carries the argument

The attractor module that solves the fixed-point equation for refined embeddings using implicit differentiation to produce constant-memory gradients.

If this is right

  • A 770M Attractor Model achieves better language-modeling perplexity than a 1.3B Transformer trained on twice as many tokens.
  • 27M-parameter Attractor Models reach 91.4 percent accuracy on Sudoku-Extreme and 93.1 percent on Maze-Hard using roughly 1000 examples.
  • Training memory remains constant as effective recurrence depth increases.
  • Models exhibit equilibrium internalization, permitting the attractor solver to be removed at inference time with little performance drop.
  • The approach delivers Pareto improvements in perplexity, downstream accuracy, and training cost over both standard Transformers and prior looped models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Equilibrium internalization could let deployed systems run without any iterative solver, cutting inference latency while retaining the benefits of training-time refinement.
  • Constant-memory training opens the possibility of scaling effective depth far beyond what explicit unrolling currently allows.
  • The same fixed-point mechanism might transfer to other iterative tasks such as multi-step planning or symbolic manipulation where explicit loops have been hard to stabilize.

Load-bearing premise

Solving the fixed-point equation via the attractor module and implicit differentiation produces gradients and behavior equivalent to explicit looped iteration without introducing convergence artifacts or bias in the learned representations.

What would settle it

An experiment in which forcing the model to use explicit fixed-depth iteration at the same effective depth produces unstable training or measurably worse perplexity and reasoning accuracy than the implicit attractor version.

read the original abstract

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Attractor Models, consisting of a backbone network that proposes initial output embeddings followed by an attractor module that refines them by solving a fixed-point equation, with gradients obtained via implicit differentiation. This design is claimed to enable constant-memory training, adaptive iteration depth based on convergence, and improved performance over standard Transformers and looped architectures. Key empirical results include a 770M-parameter Attractor Model outperforming a 1.3B Transformer trained on twice as many tokens in language modeling (with perplexity gains up to 46.6% and downstream accuracy up to 19.7%), and a 27M-parameter model with ~1000 examples achieving 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard. The work also reports a novel 'equilibrium internalization' effect allowing the solver to be dropped at inference with minimal degradation.

Significance. If the central claims hold, this approach could make iterative refinement practical at scale by converting recurrence into an internalized computation, offering Pareto improvements in efficiency and performance for both large-scale pretraining and data-efficient reasoning. The equilibrium internalization observation, if robust, would be a notable contribution to understanding how models can learn to approximate fixed-point iteration without explicit loops.

major comments (3)
  1. [§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.
  2. [§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.
  3. [§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.
minor comments (2)
  1. [§4] The abstract and §4 mention 'equilibrium internalization' as a discovered side effect, but the definition and measurement (e.g., how 'near equilibrium' is quantified and the exact inference-time degradation) should be formalized earlier with a dedicated equation or metric.
  2. Figure captions and axis labels in the scaling plots (e.g., Figure 3) use inconsistent notation for model sizes and token counts; clarify whether parameter counts include the attractor module.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, providing additional details and committing to revisions where they improve the work without misrepresenting our results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.

    Authors: We agree that stronger theoretical and empirical grounding would benefit the paper. Implicit differentiation via the implicit function theorem yields gradients equivalent to unrolled iteration at the fixed point by construction, as established in the DEQ literature; we will expand the discussion in §3.2 to include conditions for local uniqueness (e.g., when the attractor Jacobian has spectral radius <1) and reference relevant contraction-mapping results. We will also add a small-scale ablation comparing implicit versus explicit (truncated) gradients to confirm equivalence and absence of artifacts. These additions will be included in the revision. revision: partial

  2. Referee: [§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.

    Authors: All models were trained on identical data mixtures drawn from the same corpus, using the same optimizer settings, learning-rate schedule, and batch size. The 1.3B Transformer baseline was deliberately trained on twice the tokens to provide a compute-matched comparison; the 770M Attractor Model used half the tokens. We will add an explicit statement in §5.1 and a supplementary table listing the exact token counts, data composition, and hyperparameter values for each model to remove any ambiguity. revision: yes

  3. Referee: [§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.

    Authors: We acknowledge that isolating the attractor module's contribution is important. We will add ablations in the revised §6.2 that (i) compare the full Attractor Model against the backbone alone (without the fixed-point solver) and (ii) vary the number of training examples while holding architecture fixed, to quantify the solver's role in data efficiency. These experiments reuse the same training setup and will be reported alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an architectural proposal (backbone + attractor fixed-point solver with implicit differentiation) followed by empirical results on language modeling and reasoning benchmarks. No load-bearing mathematical derivation reduces to its own inputs by construction: the fixed-point formulation and implicit gradients are standard techniques from equilibrium models, the performance claims are measured outcomes rather than fitted quantities renamed as predictions, and the equilibrium internalization effect is reported as an observed side-effect rather than a definitional tautology. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided text as central justifications. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard mathematical tools for implicit differentiation through fixed points and introduces the attractor module as a new architectural component; no free parameters or invented physical entities are described in the abstract.

axioms (1)
  • standard math The implicit function theorem permits differentiation through the solution of the fixed-point equation without unrolling iterations
    Invoked to obtain gradients for the attractor module while keeping memory constant
invented entities (1)
  • Attractor module no independent evidence
    purpose: Refines proposed embeddings by solving for the fixed point of iterative refinement
    New component introduced to replace explicit recurrence loops

pith-pipeline@v0.9.0 · 5620 in / 1243 out tokens · 111190 ms · 2026-05-13T05:30:11.493188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 3 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =

    Deep Equilibrium Models , author =. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =

  2. [2]

    Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

    Press, Ofir and Wolf, Lior , title =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

  3. [3]

    2025 , eprint=

    Scaling Latent Reasoning via Looped Language Models , author=. 2025 , eprint=

  4. [4]

    2026 , eprint=

    Parcae: Scaling Laws For Stable Looped Language Models , author=. 2026 , eprint=

  5. [5]

    A little depth goes a long way:

    Merrill, Will and Sabharwal, Ashish , journal=. A little depth goes a long way:

  6. [6]

    2025 , eprint=

    Energy-Based Transformers are Scalable Learners and Thinkers , author=. 2025 , eprint=

  7. [7]

    Advances in Neural Information Processing Systems 30 , pages =

    Attention is All You Need , author =. Advances in Neural Information Processing Systems 30 , pages =

  8. [8]

    2019 , eprint=

    Universal Transformers , author=. 2019 , eprint=

  9. [9]

    2026 , eprint=

    Why Are Linear RNNs More Parallelizable? , author=. 2026 , eprint=

  10. [10]

    2025 , eprint=

    Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=

  11. [11]

    2017 , eprint=

    Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. 2017 , eprint=

  12. [12]

    Journal of the ACM , volume =

    Iterative Procedures for Nonlinear Integral Equations , author =. Journal of the ACM , volume =. 1965 , doi =

  13. [13]

    2026 , eprint=

    -Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. 2026 , eprint=

  14. [14]

    2026 , eprint=

    A Mechanistic Analysis of Looped Reasoning Language Models , author=. 2026 , eprint=

  15. [15]

    2025 , eprint=

    Next-Embedding Prediction Makes Strong Vision Learners , author=. 2025 , eprint=

  16. [16]

    2021 , eprint=

    Stabilizing Equilibrium Models by Jacobian Regularization , author=. 2021 , eprint=

  17. [17]

    Zico , title =

    Bai, Shaojie and Koltun, Vladlen and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems , volume =. 2020 , publisher =

  18. [18]

    and Parks, Harold R

    Krantz, Steven G. and Parks, Harold R. , title =. 2002 , doi =

  19. [19]

    2025 , eprint=

    Hierarchical Reasoning Model , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

  21. [21]

    2017 , eprint=

    Adaptive Computation Time for Recurrent Neural Networks , author=. 2017 , eprint=

  22. [22]

    International Conference on Machine Learning , pages=

    Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  23. [23]

    arXiv preprint arXiv:2512.11816 , year=

    Reinforcement Learning for Latent-Space Thinking in. arXiv preprint arXiv:2512.11816 , year=

  24. [24]

    Parallel continuous chain-of-thought with

    Wu, Haoyi and Teng, Zhihao and Tu, Kewei , booktitle=. Parallel continuous chain-of-thought with

  25. [25]

    2025 , eprint=

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=

  26. [26]

    The Illusion of Superposition?

    Rizvi-Martel, Michael and Rabusseau, Guillaume and Mosbach, Marius , journal=. The Illusion of Superposition?

  27. [27]

    Latent reasoning in

    Deng, Jingcheng and Pang, Liang and Wei, Zihao and Xu, Shichen and Duan, Zenghao and Xu, Kun and Song, Yang and Shen, Huawei and Cheng, Xueqi , journal=. Latent reasoning in

  28. [28]

    Deng, Jingcheng and Wei, Zihao and Pang, Liang and Wu, Junhong and Xu, Shicheng and Duan, Zenghao and Shen, Huawei , journal=

  29. [29]

    2022 , eprint=

    On Training Implicit Models , author=. 2022 , eprint=

  30. [30]

    Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

    Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner , author=. arXiv preprint arXiv:2510.03206 , year=

  31. [31]

    Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence , author=

  32. [32]

    2022 , eprint=

    Path Independent Equilibrium Models Can Better Exploit Test-Time Computation , author=. 2022 , eprint=

  33. [33]

    2021 , eprint=

    JFB: Jacobian-Free Backpropagation for Implicit Networks , author=. 2021 , eprint=

  34. [34]

    2024 , howpublished =

    The Claude 3 Model Family:. 2024 , howpublished =

  35. [35]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , booktitle=. The

  36. [36]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini:. arXiv preprint arXiv:2312.11805 , year=

  37. [37]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  38. [38]

    2024 , eprint=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

  39. [39]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

  40. [40]

    and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...

  41. [41]

    End-to-end algorithm synthesis with recurrent networks:

    Bansal, Arpit and Schwarzschild, Avi and Borgnia, Eitan and Emam, Zeyad and Huang, Furong and Goldblum, Micah and Goldstein, Tom , journal=. End-to-end algorithm synthesis with recurrent networks:

  42. [42]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  43. [43]

    2026 , eprint=

    Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning , author=. 2026 , eprint=

  44. [44]

    2024 , eprint=

    Looped Transformers are Better at Learning Learning Algorithms , author=. 2024 , eprint=

  45. [45]

    2026 , eprint=

    Stability and Generalization in Looped Transformers , author=. 2026 , eprint=

  46. [46]

    2026 , eprint=

    Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers , author=. 2026 , eprint=

  47. [47]

    2025 , eprint=

    Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding , author=. 2025 , eprint=

  49. [49]

    The Fourteenth International Conference on Learning Representations , year=

    Equilibrium Language Models , author=. The Fourteenth International Conference on Learning Representations , year=

  50. [50]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  51. [51]

    2025 , month = jul, day =

    SmolLM3: smol, multilingual, long-context reasoner , author =. 2025 , month = jul, day =

  52. [52]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  53. [53]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  54. [54]

    2025 , eprint=

    Looped Transformers for Length Generalization , author=. 2025 , eprint=

  55. [55]

    2025 , eprint=

    SIM-CoT: Supervised Implicit Chain-of-Thought , author=. 2025 , eprint=

  56. [56]

    Mohtashami, Amirkeivan and Pagliardini, Matteo and Jaggi, Martin , booktitle=. Co

  57. [57]

    2026 , eprint=

    How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models , author=. 2026 , eprint=

  58. [58]

    2025 , eprint=

    Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , author=. 2025 , eprint=

  59. [59]

    2025 , eprint=

    Next-Latent Prediction Transformers Learn Compact World Models , author=. 2025 , eprint=

  60. [60]

    2025 , eprint=

    Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking , author=. 2025 , eprint=

  61. [61]

    2026 , eprint=

    A Formal Comparison Between Chain of Thought and Latent Thought , author=. 2026 , eprint=

  62. [62]

    2026 , eprint=

    LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , author=. 2026 , eprint=

  63. [63]

    2026 , eprint=

    Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. 2026 , eprint=

  64. [64]

    Emergence of superposition:

    Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart and Tian, Yuandong , journal=. Emergence of superposition:

  65. [65]

    Hyperloop Transformers

    Hyperloop Transformers , author=. arXiv preprint arXiv:2604.21254 , year=

  66. [66]

    Knupp, Jonas and Metzen, Jan Hendrik and Bohn, Jeremias and Groh, Georg and Kersting, Kristian , journal=

  67. [67]

    Song, Shixiang and Li, He and Wang, Zitong and Zeng, Boyi and Song, Feichen and Wang, Yixuan and Xu, Zhiqin John and He, Ziwei and Lin, Zhouhan , journal=

  68. [68]

    Zhang, Xiaojing and Wu, Haifeng and He, Gang and Shen, Jiyang and Lyu, Bochen and Zhu, Zhanxing , booktitle=

  69. [69]

    Zheng, Yunao and Wang, Xiaojie and Ren, Lei and Wei, Chen , booktitle=

  70. [70]

    2025 , eprint=

    Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models , author=. 2025 , eprint=

  71. [71]

    International Conference on Machine Learning , pages=

    Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  72. [72]

    Reasoning by superposition:

    Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart J and Tian, Yuandong , journal=. Reasoning by superposition:

  73. [73]

    2025 , eprint=

    The Belief State Transformer , author=. 2025 , eprint=

  74. [74]

    2025 , eprint=

    The Free Transformer , author=. 2025 , eprint=

  75. [75]

    2002 , publisher=

    The implicit function theorem: history, theory, and applications , author=. 2002 , publisher=

  76. [76]

    2016 , eprint=

    Neural GPUs Learn Algorithms , author=. 2016 , eprint=

  77. [77]

    2020 , eprint=

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author=. 2020 , eprint=

  78. [78]

    2018 , eprint=

    Recurrent Stacking of Layers for Compact Neural Machine Translation Models , author=. 2018 , eprint=

  79. [79]

    2023 , eprint=

    Lessons on Parameter Sharing across Layers in Transformers , author=. 2023 , eprint=

  80. [80]

    2025 , eprint=

    Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA , author=. 2025 , eprint=

Showing first 80 references.