Recognition: 2 theorem links
· Lean TheoremSolve the Loop: Attractor Models for Language and Reasoning
Pith reviewed 2026-05-13 05:30 UTC · model grok-4.3
The pith
Attractor Models solve for fixed points to enable stable iterative refinement in language and reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attractor Models refine proposed output embeddings by solving for the fixed point of the attractor module, with gradients computed via implicit differentiation. This formulation keeps training memory constant in effective depth and selects iterations adaptively by convergence. The resulting models outperform standard Transformers on language-model pretraining while reducing cost, and small instances achieve high accuracy on difficult reasoning benchmarks. Fixed-point training further induces equilibrium internalization, allowing the solver to be removed at inference with minimal degradation.
What carries the argument
The attractor module that solves the fixed-point equation for refined embeddings using implicit differentiation to produce constant-memory gradients.
If this is right
- A 770M Attractor Model achieves better language-modeling perplexity than a 1.3B Transformer trained on twice as many tokens.
- 27M-parameter Attractor Models reach 91.4 percent accuracy on Sudoku-Extreme and 93.1 percent on Maze-Hard using roughly 1000 examples.
- Training memory remains constant as effective recurrence depth increases.
- Models exhibit equilibrium internalization, permitting the attractor solver to be removed at inference time with little performance drop.
- The approach delivers Pareto improvements in perplexity, downstream accuracy, and training cost over both standard Transformers and prior looped models.
Where Pith is reading between the lines
- Equilibrium internalization could let deployed systems run without any iterative solver, cutting inference latency while retaining the benefits of training-time refinement.
- Constant-memory training opens the possibility of scaling effective depth far beyond what explicit unrolling currently allows.
- The same fixed-point mechanism might transfer to other iterative tasks such as multi-step planning or symbolic manipulation where explicit loops have been hard to stabilize.
Load-bearing premise
Solving the fixed-point equation via the attractor module and implicit differentiation produces gradients and behavior equivalent to explicit looped iteration without introducing convergence artifacts or bias in the learned representations.
What would settle it
An experiment in which forcing the model to use explicit fixed-depth iteration at the same effective depth produces unstable training or measurably worse perplexity and reasoning accuracy than the implicit attractor version.
read the original abstract
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Attractor Models, consisting of a backbone network that proposes initial output embeddings followed by an attractor module that refines them by solving a fixed-point equation, with gradients obtained via implicit differentiation. This design is claimed to enable constant-memory training, adaptive iteration depth based on convergence, and improved performance over standard Transformers and looped architectures. Key empirical results include a 770M-parameter Attractor Model outperforming a 1.3B Transformer trained on twice as many tokens in language modeling (with perplexity gains up to 46.6% and downstream accuracy up to 19.7%), and a 27M-parameter model with ~1000 examples achieving 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard. The work also reports a novel 'equilibrium internalization' effect allowing the solver to be dropped at inference with minimal degradation.
Significance. If the central claims hold, this approach could make iterative refinement practical at scale by converting recurrence into an internalized computation, offering Pareto improvements in efficiency and performance for both large-scale pretraining and data-efficient reasoning. The equilibrium internalization observation, if robust, would be a notable contribution to understanding how models can learn to approximate fixed-point iteration without explicit loops.
major comments (3)
- [§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.
- [§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.
- [§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.
minor comments (2)
- [§4] The abstract and §4 mention 'equilibrium internalization' as a discovered side effect, but the definition and measurement (e.g., how 'near equilibrium' is quantified and the exact inference-time degradation) should be formalized earlier with a dedicated equation or metric.
- Figure captions and axis labels in the scaling plots (e.g., Figure 3) use inconsistent notation for model sizes and token counts; clarify whether parameter counts include the attractor module.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below, providing additional details and committing to revisions where they improve the work without misrepresenting our results.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Attractor module and implicit differentiation): The central claim that implicit differentiation on the fixed-point equation produces gradients and behavior equivalent to explicit looped iteration (without convergence artifacts or representation bias) is load-bearing for all stability, memory, and performance claims, yet the manuscript provides no convergence guarantees, uniqueness proofs for the fixed point, or ablations comparing implicit vs. explicit gradients. This directly engages the risk that reported gains (e.g., the 770M vs. 1.3B comparison) could arise from training artifacts rather than true iterative refinement.
Authors: We agree that stronger theoretical and empirical grounding would benefit the paper. Implicit differentiation via the implicit function theorem yields gradients equivalent to unrolled iteration at the fixed point by construction, as established in the DEQ literature; we will expand the discussion in §3.2 to include conditions for local uniqueness (e.g., when the attractor Jacobian has spectral radius <1) and reference relevant contraction-mapping results. We will also add a small-scale ablation comparing implicit versus explicit (truncated) gradients to confirm equivalence and absence of artifacts. These additions will be included in the revision. revision: partial
-
Referee: [§5.1] §5.1 and Table 2 (language modeling results): The Pareto improvement claim for the 770M Attractor Model over the 1.3B Transformer on twice the tokens requires explicit confirmation that data mixtures, token counts, and optimization hyperparameters are matched; without this, the 46.6% perplexity gain cannot be attributed to the attractor mechanism rather than confounding factors.
Authors: All models were trained on identical data mixtures drawn from the same corpus, using the same optimizer settings, learning-rate schedule, and batch size. The 1.3B Transformer baseline was deliberately trained on twice the tokens to provide a compute-matched comparison; the 770M Attractor Model used half the tokens. We will add an explicit statement in §5.1 and a supplementary table listing the exact token counts, data composition, and hyperparameter values for each model to remove any ambiguity. revision: yes
-
Referee: [§6.2] §6.2 (reasoning experiments): The 27M model achieving 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard with only ~1000 examples is presented as evidence of favorable scaling where larger models fail, but the manuscript lacks ablations isolating the contribution of the attractor fixed-point solver versus the backbone or data construction; this is load-bearing for the claim that the architecture enables data-efficient reasoning.
Authors: We acknowledge that isolating the attractor module's contribution is important. We will add ablations in the revised §6.2 that (i) compare the full Attractor Model against the backbone alone (without the fixed-point solver) and (ii) vary the number of training examples while holding architecture fixed, to quantify the solver's role in data efficiency. These experiments reuse the same training setup and will be reported alongside the existing results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an architectural proposal (backbone + attractor fixed-point solver with implicit differentiation) followed by empirical results on language modeling and reasoning benchmarks. No load-bearing mathematical derivation reduces to its own inputs by construction: the fixed-point formulation and implicit gradients are standard techniques from equilibrium models, the performance claims are measured outcomes rather than fitted quantities renamed as predictions, and the equilibrium internalization effect is reported as an observed side-effect rather than a definitional tautology. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided text as central justifications. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The implicit function theorem permits differentiation through the solution of the fixed-point equation without unrolling iterations
invented entities (1)
-
Attractor module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoesattractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation... equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery from Law of Logic echoesthe attractor module appears to act as a moving teacher for the backbone... automatic curriculum... recurrence acts as a moving training target
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =
Deep Equilibrium Models , author =. Advances in Neural Information Processing Systems 32 (NeurIPS 2019) , year =
work page 2019
-
[2]
Press, Ofir and Wolf, Lior , title =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages =
-
[3]
Scaling Latent Reasoning via Looped Language Models , author=. 2025 , eprint=
work page 2025
-
[4]
Parcae: Scaling Laws For Stable Looped Language Models , author=. 2026 , eprint=
work page 2026
-
[5]
A little depth goes a long way:
Merrill, Will and Sabharwal, Ashish , journal=. A little depth goes a long way:
-
[6]
Energy-Based Transformers are Scalable Learners and Thinkers , author=. 2025 , eprint=
work page 2025
-
[7]
Advances in Neural Information Processing Systems 30 , pages =
Attention is All You Need , author =. Advances in Neural Information Processing Systems 30 , pages =
- [8]
- [9]
-
[10]
Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=
work page 2025
-
[11]
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. 2017 , eprint=
work page 2017
-
[12]
Iterative Procedures for Nonlinear Integral Equations , author =. Journal of the ACM , volume =. 1965 , doi =
work page 1965
-
[13]
-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space , author=. 2026 , eprint=
work page 2026
-
[14]
A Mechanistic Analysis of Looped Reasoning Language Models , author=. 2026 , eprint=
work page 2026
-
[15]
Next-Embedding Prediction Makes Strong Vision Learners , author=. 2025 , eprint=
work page 2025
-
[16]
Stabilizing Equilibrium Models by Jacobian Regularization , author=. 2021 , eprint=
work page 2021
-
[17]
Bai, Shaojie and Koltun, Vladlen and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems , volume =. 2020 , publisher =
work page 2020
- [18]
- [19]
-
[20]
Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=
work page 2025
-
[21]
Adaptive Computation Time for Recurrent Neural Networks , author=. 2017 , eprint=
work page 2017
-
[22]
International Conference on Machine Learning , pages=
Looped transformers as programmable computers , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[23]
arXiv preprint arXiv:2512.11816 , year=
Reinforcement Learning for Latent-Space Thinking in. arXiv preprint arXiv:2512.11816 , year=
-
[24]
Parallel continuous chain-of-thought with
Wu, Haoyi and Teng, Zhihao and Tu, Kewei , booktitle=. Parallel continuous chain-of-thought with
-
[25]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach , author=. 2025 , eprint=
work page 2025
-
[26]
The Illusion of Superposition?
Rizvi-Martel, Michael and Rabusseau, Guillaume and Mosbach, Marius , journal=. The Illusion of Superposition?
-
[27]
Deng, Jingcheng and Pang, Liang and Wei, Zihao and Xu, Shichen and Duan, Zenghao and Xu, Kun and Song, Yang and Shen, Huawei and Cheng, Xueqi , journal=. Latent reasoning in
-
[28]
Deng, Jingcheng and Wei, Zihao and Pang, Liang and Wu, Junhong and Xu, Shicheng and Duan, Zenghao and Shen, Huawei , journal=
- [29]
-
[30]
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner , author=. arXiv preprint arXiv:2510.03206 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence , author=
-
[32]
Path Independent Equilibrium Models Can Better Exploit Test-Time Computation , author=. 2022 , eprint=
work page 2022
-
[33]
JFB: Jacobian-Free Backpropagation for Implicit Networks , author=. 2021 , eprint=
work page 2021
- [34]
-
[35]
Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , booktitle=. The
-
[36]
Gemini: A Family of Highly Capable Multimodal Models
Gemini:. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [37]
-
[38]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=
work page 2024
-
[39]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, Denis and Kruszewski, Germ \'a n and Lazaridou, Angeliki and Pham, Ngoc Quan and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern \'a ndez, Raquel. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (...
-
[40]
and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =
Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...
-
[41]
End-to-end algorithm synthesis with recurrent networks:
Bansal, Arpit and Schwarzschild, Avi and Borgnia, Eitan and Emam, Zeyad and Huang, Furong and Goldblum, Micah and Goldstein, Tom , journal=. End-to-end algorithm synthesis with recurrent networks:
-
[42]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[43]
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning , author=. 2026 , eprint=
work page 2026
-
[44]
Looped Transformers are Better at Learning Learning Algorithms , author=. 2024 , eprint=
work page 2024
-
[45]
Stability and Generalization in Looped Transformers , author=. 2026 , eprint=
work page 2026
-
[46]
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers , author=. 2026 , eprint=
work page 2026
-
[47]
Reasoning with Latent Thoughts: On the Power of Looped Transformers , author=. 2025 , eprint=
work page 2025
-
[48]
On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding , author=. 2025 , eprint=
work page 2025
-
[49]
The Fourteenth International Conference on Learning Representations , year=
Equilibrium Language Models , author=. The Fourteenth International Conference on Learning Representations , year=
-
[50]
Training Compute-Optimal Large Language Models , author=. 2022 , eprint=
work page 2022
-
[51]
SmolLM3: smol, multilingual, long-context reasoner , author =. 2025 , month = jul, day =
work page 2025
- [52]
-
[53]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[54]
Looped Transformers for Length Generalization , author=. 2025 , eprint=
work page 2025
-
[55]
SIM-CoT: Supervised Implicit Chain-of-Thought , author=. 2025 , eprint=
work page 2025
-
[56]
Mohtashami, Amirkeivan and Pagliardini, Matteo and Jaggi, Martin , booktitle=. Co
-
[57]
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models , author=. 2026 , eprint=
work page 2026
-
[58]
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation , author=. 2025 , eprint=
work page 2025
-
[59]
Next-Latent Prediction Transformers Learn Compact World Models , author=. 2025 , eprint=
work page 2025
-
[60]
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking , author=. 2025 , eprint=
work page 2025
-
[61]
A Formal Comparison Between Chain of Thought and Latent Thought , author=. 2026 , eprint=
work page 2026
-
[62]
LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation , author=. 2026 , eprint=
work page 2026
-
[63]
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. 2026 , eprint=
work page 2026
-
[64]
Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart and Tian, Yuandong , journal=. Emergence of superposition:
-
[65]
Hyperloop Transformers , author=. arXiv preprint arXiv:2604.21254 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
Knupp, Jonas and Metzen, Jan Hendrik and Bohn, Jeremias and Groh, Georg and Kersting, Kristian , journal=
-
[67]
Song, Shixiang and Li, He and Wang, Zitong and Zeng, Boyi and Song, Feichen and Wang, Yixuan and Xu, Zhiqin John and He, Ziwei and Lin, Zhouhan , journal=
-
[68]
Zhang, Xiaojing and Wu, Haifeng and He, Gang and Shen, Jiyang and Lyu, Bochen and Zhu, Zhanxing , booktitle=
-
[69]
Zheng, Yunao and Wang, Xiaojie and Ren, Lei and Wei, Chen , booktitle=
-
[70]
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models , author=. 2025 , eprint=
work page 2025
-
[71]
International Conference on Machine Learning , pages=
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[72]
Zhu, Hanlin and Hao, Shibo and Hu, Zhiting and Jiao, Jiantao and Russell, Stuart J and Tian, Yuandong , journal=. Reasoning by superposition:
- [73]
- [74]
-
[75]
The implicit function theorem: history, theory, and applications , author=. 2002 , publisher=
work page 2002
- [76]
-
[77]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , author=. 2020 , eprint=
work page 2020
-
[78]
Recurrent Stacking of Layers for Compact Neural Machine Translation Models , author=. 2018 , eprint=
work page 2018
-
[79]
Lessons on Parameter Sharing across Layers in Transformers , author=. 2023 , eprint=
work page 2023
-
[80]
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA , author=. 2025 , eprint=
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.