arxiv: 2605.12466 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· cs.CL· cs.NE

Recognition: 2 theorem links

· Lean Theorem

Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley, Paria Rashidinejad

Pith reviewed 2026-05-13 05:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.NE

keywords attractor modelsfixed-point solvingimplicit differentiationlooped transformerslanguage modelingreasoning tasksequilibrium internalization

0 comments

The pith

Attractor Models solve for fixed points to enable stable iterative refinement in language and reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Attractor Models as a way to gain the benefits of looped computation without its usual training problems. A backbone network proposes an embedding and an attractor module then solves for the fixed point of a refinement function, with gradients flowing through implicit differentiation rather than unrolled steps. This keeps memory usage fixed even as effective depth grows and lets the number of iterations adapt to convergence. In practice the approach yields lower perplexity than Transformers of similar size, improves downstream accuracy, and lets very small models reach high performance on hard reasoning problems where larger systems struggle. The models also learn to place their initial proposal near the equilibrium, so the attractor can often be dropped at inference time with little loss.

Core claim

Attractor Models refine proposed output embeddings by solving for the fixed point of the attractor module, with gradients computed via implicit differentiation. This formulation keeps training memory constant in effective depth and selects iterations adaptively by convergence. The resulting models outperform standard Transformers on language-model pretraining while reducing cost, and small instances achieve high accuracy on difficult reasoning benchmarks. Fixed-point training further induces equilibrium internalization, allowing the solver to be removed at inference with minimal degradation.

What carries the argument

The attractor module that solves the fixed-point equation for refined embeddings using implicit differentiation to produce constant-memory gradients.

If this is right

A 770M Attractor Model achieves better language-modeling perplexity than a 1.3B Transformer trained on twice as many tokens.
27M-parameter Attractor Models reach 91.4 percent accuracy on Sudoku-Extreme and 93.1 percent on Maze-Hard using roughly 1000 examples.
Training memory remains constant as effective recurrence depth increases.
Models exhibit equilibrium internalization, permitting the attractor solver to be removed at inference time with little performance drop.
The approach delivers Pareto improvements in perplexity, downstream accuracy, and training cost over both standard Transformers and prior looped models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Equilibrium internalization could let deployed systems run without any iterative solver, cutting inference latency while retaining the benefits of training-time refinement.
Constant-memory training opens the possibility of scaling effective depth far beyond what explicit unrolling currently allows.
The same fixed-point mechanism might transfer to other iterative tasks such as multi-step planning or symbolic manipulation where explicit loops have been hard to stabilize.

Load-bearing premise

Solving the fixed-point equation via the attractor module and implicit differentiation produces gradients and behavior equivalent to explicit looped iteration without introducing convergence artifacts or bias in the learned representations.

What would settle it

An experiment in which forcing the model to use explicit fixed-depth iteration at the same effective depth produces unstable training or measurably worse perplexity and reasoning accuracy than the implicit attractor version.

read the original abstract

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Attractor models use fixed-point solving with implicit differentiation to handle iteration in transformers, showing reported empirical gains in pretraining and tiny-model reasoning but with open questions on whether the gradients truly match explicit loops.

read the letter

The main thing here is that they swap explicit unrolling for an attractor module that solves a fixed-point equation via implicit differentiation. Training memory stays constant, iterations adapt to convergence, and they observe that the initial embedding often lands near equilibrium so the solver can be removed at inference with little drop-off. A 770M model beats a 1.3B transformer trained on twice the tokens, and a 27M model hits 91%+ on Sudoku-Extreme and Maze-Hard using roughly 1000 examples where much larger systems fail. The internalization effect is presented as a discovered side benefit rather than something engineered in advance. What is actually new is the specific attractor-plus-implicit-diff combination plus that internalization observation; it builds on looped transformers but adds a distinct training path and a practical inference shortcut. The paper does well at giving concrete numbers across scales and regimes instead of vague claims. The Pareto improvement in language modeling and the scaling behavior on hard reasoning tasks are stated clearly enough to evaluate. The soft spots sit mainly around the mechanism itself. Implicit differentiation is claimed to deliver stable gradients equivalent to explicit iteration, but the abstract and stress-test note give no convergence guarantees, uniqueness proofs for the fixed point, or ablations that isolate solver artifacts from genuine refinement benefits. If the fixed point is not reliably attractive or if numerical issues bias the embeddings, the performance edges could trace to optimization differences rather than iteration. The low-data reasoning results also need scrutiny on whether they reflect broad capability or task-specific fitting. This is for people working on efficient iterative methods inside language models or on making recurrence practical at scale. Readers already experimenting with looped or recurrent transformers will get the most direct value from the internalization finding and the constant-memory training angle. It deserves a serious referee because the technical distinction is real and the empirical claims are specific enough to repay detailed checking, even if the theoretical grounding on gradients and stability will need strengthening. I would send it out for review and ask the referees to focus on gradient equivalence checks and solver stability ablations.

Referee Report

3 major / 2 minor

Summary. The paper introduces Attractor Models, consisting of a backbone network that proposes initial output embeddings followed by an attractor module that refines them by solving a fixed-point equation, with gradients obtained via implicit differentiation. This design is claimed to enable constant-memory training, adaptive iteration depth based on convergence, and improved performance over standard Transformers and looped architectures. Key empirical results include a 770M-parameter Attractor Model outperforming a 1.3B Transformer trained on twice as many tokens in language modeling (with perplexity gains up to 46.6% and downstream accuracy up to 19.7%), and a 27M-parameter model with ~1000 examples achieving 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard. The work also reports a novel 'equilibrium internalization' effect allowing the solver to be dropped at inference with minimal degradation.

Significance. If the central claims hold, this approach could make iterative refinement practical at scale by converting recurrence into an internalized computation, offering Pareto improvements in efficiency and performance for both large-scale pretraining and data-efficient reasoning. The equilibrium internalization observation, if robust, would be a notable contribution to understanding how models can learn to approximate fixed-point iteration without explicit loops.

major comments (3)

[§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.
[§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.
[§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.

minor comments (2)

[§4] The abstract and §4 mention 'equilibrium internalization' as a discovered side effect, but the definition and measurement (e.g., how 'near equilibrium' is quantified and the exact inference-time degradation) should be formalized earlier with a dedicated equation or metric.
Figure captions and axis labels in the scaling plots (e.g., Figure 3) use inconsistent notation for model sizes and token counts; clarify whether parameter counts include the attractor module.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, providing additional details and committing to revisions where they improve the work without misrepresenting our results.

read point-by-point responses

Referee: [§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.

Authors: We agree that stronger theoretical and empirical grounding would benefit the paper. Implicit differentiation via the implicit function theorem yields gradients equivalent to unrolled iteration at the fixed point by construction, as established in the DEQ literature; we will expand the discussion in §3.2 to include conditions for local uniqueness (e.g., when the attractor Jacobian has spectral radius <1) and reference relevant contraction-mapping results. We will also add a small-scale ablation comparing implicit versus explicit (truncated) gradients to confirm equivalence and absence of artifacts. These additions will be included in the revision. revision: partial
Referee: [§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.

Authors: All models were trained on identical data mixtures drawn from the same corpus, using the same optimizer settings, learning-rate schedule, and batch size. The 1.3B Transformer baseline was deliberately trained on twice the tokens to provide a compute-matched comparison; the 770M Attractor Model used half the tokens. We will add an explicit statement in §5.1 and a supplementary table listing the exact token counts, data composition, and hyperparameter values for each model to remove any ambiguity. revision: yes
Referee: [§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.

Authors: We acknowledge that isolating the attractor module's contribution is important. We will add ablations in the revised §6.2 that (i) compare the full Attractor Model against the backbone alone (without the fixed-point solver) and (ii) vary the number of training examples while holding architecture fixed, to quantify the solver's role in data efficiency. These experiments reuse the same training setup and will be reported alongside the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an architectural proposal (backbone + attractor fixed-point solver with implicit differentiation) followed by empirical results on language modeling and reasoning benchmarks. No load-bearing mathematical derivation reduces to its own inputs by construction: the fixed-point formulation and implicit gradients are standard techniques from equilibrium models, the performance claims are measured outcomes rather than fitted quantities renamed as predictions, and the equilibrium internalization effect is reported as an observed side-effect rather than a definitional tautology. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided text as central justifications. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard mathematical tools for implicit differentiation through fixed points and introduces the attractor module as a new architectural component; no free parameters or invented physical entities are described in the abstract.

axioms (1)

standard math The implicit function theorem permits differentiation through the solution of the fixed-point equation without unrolling iterations
Invoked to obtain gradients for the attractor module while keeping memory constant

invented entities (1)

Attractor module no independent evidence
purpose: Refines proposed embeddings by solving for the fixed point of iterative refinement
New component introduced to replace explicit recurrence loops

pith-pipeline@v0.9.0 · 5620 in / 1243 out tokens · 111190 ms · 2026-05-13T05:30:11.493188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation... equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery from Law of Logic echoes
the attractor module appears to act as a moving teacher for the backbone... automatic curriculum... recurrence acts as a moving training target

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 3 internal anchors

[1]

Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =

Deep Equilibrium Models , author =. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =

work page 2019
[2]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

Press, Ofir and Wolf, Lior , title =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =

work page
[3]

2025 , eprint=

Scaling Latent Reasoning via Looped Language Models , author=. 2025 , eprint=

work page 2025
[4]

2026 , eprint=

Parcae: Scaling Laws For Stable Looped Language Models , author=. 2026 , eprint=

work page 2026
[5]

A little depth goes a long way:

Merrill, Will and Sabharwal, Ashish , journal=. A little depth goes a long way:

work page
[6]

2025 , eprint=

Energy-Based Transformers are Scalable Learners and Thinkers , author=. 2025 , eprint=

work page 2025
[7]

Advances in Neural Information Processing Systems 30 , pages =

Attention is All You Need , author =. Advances in Neural Information Processing Systems 30 , pages =

work page
[8]

2019 , eprint=

Universal Transformers , author=. 2019 , eprint=

work page 2019
[9]

2026 , eprint=

Why Are Linear RNNs More Parallelizable? , author=. 2026 , eprint=

work page 2026
[10]

2025 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=

work page 2025
[11]

2017 , eprint=

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. 2017 , eprint=

work page 2017
[12]

Journal of the ACM , volume =

Iterative Procedures for Nonlinear Integral Equations , author =. Journal of the ACM , volume =. 1965 , doi =

work page 1965
[13]

2026 , eprint=

-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. 2026 , eprint=

work page 2026
[14]

2026 , eprint=

A Mechanistic Analysis of Looped Reasoning Language Models , author=. 2026 , eprint=

work page 2026
[15]

2025 , eprint=

Next-Embedding Prediction Makes Strong Vision Learners , author=. 2025 , eprint=

work page 2025
[16]

2021 , eprint=

Stabilizing Equilibrium Models by Jacobian Regularization , author=. 2021 , eprint=

work page 2021
[17]

Zico , title =

Bai, Shaojie and Koltun, Vladlen and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems , volume =. 2020 , publisher =

work page 2020
[18]

and Parks, Harold R

Krantz, Steven G. and Parks, Harold R. , title =. 2002 , doi =

work page 2002
[19]

2025 , eprint=

Hierarchical Reasoning Model , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

work page 2025
[21]

2017 , eprint=

Adaptive Computation Time for Recurrent Neural Networks , author=. 2017 , eprint=

work page 2017
[22]

International Conference on Machine Learning , pages=

Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[23]

arXiv preprint arXiv:2512.11816 , year=

Reinforcement Learning for Latent-Space Thinking in. arXiv preprint arXiv:2512.11816 , year=

work page arXiv
[24]

Parallel continuous chain-of-thought with

Wu, Haoyi and Teng, Zhihao and Tu, Kewei , booktitle=. Parallel continuous chain-of-thought with

work page
[25]

2025 , eprint=

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=

work page 2025
[26]

The Illusion of Superposition?

Rizvi-Martel, Michael and Rabusseau, Guillaume and Mosbach, Marius , journal=. The Illusion of Superposition?

work page
[27]

Latent reasoning in

Deng, Jingcheng and Pang, Liang and Wei, Zihao and Xu, Shichen and Duan, Zenghao and Xu, Kun and Song, Yang and Shen, Huawei and Cheng, Xueqi , journal=. Latent reasoning in

work page
[28]

Deng, Jingcheng and Wei, Zihao and Pang, Liang and Wu, Junhong and Xu, Shicheng and Duan, Zenghao and Shen, Huawei , journal=

work page
[29]

2022 , eprint=

On Training Implicit Models , author=. 2022 , eprint=

work page 2022
[30]

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner , author=. arXiv preprint arXiv:2510.03206 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence , author=

work page
[32]

2022 , eprint=

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation , author=. 2022 , eprint=

work page 2022
[33]

2021 , eprint=

JFB: Jacobian-Free Backpropagation for Implicit Networks , author=. 2021 , eprint=

work page 2021
[34]

2024 , howpublished =

The Claude 3 Model Family:. 2024 , howpublished =

work page 2024
[35]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , booktitle=. The

work page
[36]

Gemini: A Family of Highly Capable Multimodal Models

Gemini:. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[38]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

work page 2024
[39]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...

work page doi:10.18653/v1/p16-1144 2016
[40]

and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...

work page doi:10.52202/079017-0455
[41]

End-to-end algorithm synthesis with recurrent networks:

Bansal, Arpit and Schwarzschild, Avi and Borgnia, Eitan and Emam, Zeyad and Huang, Furong and Goldblum, Micah and Goldstein, Tom , journal=. End-to-end algorithm synthesis with recurrent networks:

work page
[42]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[43]

2026 , eprint=

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning , author=. 2026 , eprint=

work page 2026
[44]

2024 , eprint=

Looped Transformers are Better at Learning Learning Algorithms , author=. 2024 , eprint=

work page 2024
[45]

2026 , eprint=

Stability and Generalization in Looped Transformers , author=. 2026 , eprint=

work page 2026
[46]

2026 , eprint=

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers , author=. 2026 , eprint=

work page 2026
[47]

2025 , eprint=

Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. 2025 , eprint=

work page 2025
[48]

2025 , eprint=

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding , author=. 2025 , eprint=

work page 2025
[49]

The Fourteenth International Conference on Learning Representations , year=

Equilibrium Language Models , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[50]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[51]

2025 , month = jul, day =

SmolLM3: smol, multilingual, long-context reasoner , author =. 2025 , month = jul, day =

work page 2025
[52]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[53]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[54]

2025 , eprint=

Looped Transformers for Length Generalization , author=. 2025 , eprint=

work page 2025
[55]

2025 , eprint=

SIM-CoT: Supervised Implicit Chain-of-Thought , author=. 2025 , eprint=

work page 2025
[56]

Mohtashami, Amirkeivan and Pagliardini, Matteo and Jaggi, Martin , booktitle=. Co

work page
[57]

2026 , eprint=

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models , author=. 2026 , eprint=

work page 2026
[58]

2025 , eprint=

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , author=. 2025 , eprint=

work page 2025
[59]

2025 , eprint=

Next-Latent Prediction Transformers Learn Compact World Models , author=. 2025 , eprint=

work page 2025
[60]

2025 , eprint=

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking , author=. 2025 , eprint=

work page 2025
[61]

2026 , eprint=

A Formal Comparison Between Chain of Thought and Latent Thought , author=. 2026 , eprint=

work page 2026
[62]

2026 , eprint=

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , author=. 2026 , eprint=

work page 2026
[63]

2026 , eprint=

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. 2026 , eprint=

work page 2026
[64]

Emergence of superposition:

Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart and Tian, Yuandong , journal=. Emergence of superposition:

work page
[65]

Hyperloop Transformers

Hyperloop Transformers , author=. arXiv preprint arXiv:2604.21254 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Knupp, Jonas and Metzen, Jan Hendrik and Bohn, Jeremias and Groh, Georg and Kersting, Kristian , journal=

work page
[67]

Song, Shixiang and Li, He and Wang, Zitong and Zeng, Boyi and Song, Feichen and Wang, Yixuan and Xu, Zhiqin John and He, Ziwei and Lin, Zhouhan , journal=

work page
[68]

Zhang, Xiaojing and Wu, Haifeng and He, Gang and Shen, Jiyang and Lyu, Bochen and Zhu, Zhanxing , booktitle=

work page
[69]

Zheng, Yunao and Wang, Xiaojie and Ren, Lei and Wei, Chen , booktitle=

work page
[70]

2025 , eprint=

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models , author=. 2025 , eprint=

work page 2025
[71]

International Conference on Machine Learning , pages=

Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[72]

Reasoning by superposition:

Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart J and Tian, Yuandong , journal=. Reasoning by superposition:

work page
[73]

2025 , eprint=

The Belief State Transformer , author=. 2025 , eprint=

work page 2025
[74]

2025 , eprint=

The Free Transformer , author=. 2025 , eprint=

work page 2025
[75]

2002 , publisher=

The implicit function theorem: history, theory, and applications , author=. 2002 , publisher=

work page 2002
[76]

2016 , eprint=

Neural GPUs Learn Algorithms , author=. 2016 , eprint=

work page 2016
[77]

2020 , eprint=

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author=. 2020 , eprint=

work page 2020
[78]

2018 , eprint=

Recurrent Stacking of Layers for Compact Neural Machine Translation Models , author=. 2018 , eprint=

work page 2018
[79]

2023 , eprint=

Lessons on Parameter Sharing across Layers in Transformers , author=. 2023 , eprint=

work page 2023
[80]

2025 , eprint=

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA , author=. 2025 , eprint=

work page 2025

Showing first 80 references.