pith. sign in

arxiv: 2605.20613 · v1 · pith:SSCTPGT3new · submitted 2026-05-20 · 💻 cs.CL

HRM-Text: Efficient Pretraining Beyond Scaling

Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords efficient pretraininghierarchical recurrent modelsinstruction-response traininglanguage modelingreasoning benchmarkscompute efficiencytask-completion objective
0
0 comments X

The pith

A 1B hierarchical recurrent model trained on 40 billion instruction tokens reaches 60.7% on MMLU and competitive scores on reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the dominant approach of scaling transformers on raw internet text is not the only route to capable language models. It replaces the transformer with a hierarchical recurrent architecture that separates slow strategic processing from fast execution steps. Training occurs solely on instruction-response pairs under a task-completion objective with PrefixLM masking rather than next-token prediction on raw text. A 1B-parameter model trained from scratch on 40 billion unique tokens for roughly $1500 matches or approaches the performance of 2-7B open models on MMLU, ARC-C, DROP, GSM8K, and MATH. The results are presented as evidence that joint changes to architecture and objective can sharply lower the data and compute needed for capable pretraining.

Core claim

A Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic layers and fast-evolving execution layers, stabilized by MagicNorm and warmup deep credit assignment, can be trained from scratch exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking to reach 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH after only 40 billion unique tokens.

What carries the argument

The Hierarchical Recurrent Model (HRM) that separates slow strategic and fast execution layers, together with MagicNorm stabilization and warmup deep credit assignment, used under a task-completion objective on instruction-response pairs with PrefixLM masking.

If this is right

  • Pretraining from scratch becomes feasible for groups without access to massive raw-text corpora or large compute clusters.
  • Co-design of recurrent hierarchy and task-completion objective can reduce required training tokens by roughly two orders of magnitude while preserving benchmark performance.
  • Instruction-response data alone can serve as the primary signal for acquiring both language understanding and multi-step reasoning.
  • The compute-to-performance ratio improves enough that open research groups can iterate on foundational models without industrial budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the instruction-only signal proves sufficient at larger scales, the field could shift away from scraping raw web text toward curated task-oriented datasets that reduce noise and bias.
  • The slow-fast recurrence pattern may transfer to other sequence domains such as code or scientific text once the credit-assignment stabilization is adapted.
  • A natural next test is whether the same HRM backbone, when scaled to 7-13B parameters on the same 40B tokens, widens the gap over standard transformers or saturates.
  • The approach invites direct comparison of next-token versus task-completion objectives on identical data to isolate how much of the efficiency gain comes from the objective versus the architecture.

Load-bearing premise

Training exclusively on instruction-response pairs with a task-completion objective supplies a sufficient pretraining signal for general language and reasoning capabilities.

What would settle it

Training a standard transformer from scratch on the exact same 40 billion instruction-response tokens and finding that it matches or exceeds the HRM-Text scores on the same suite of benchmarks.

Figures

Figures reproduced from arXiv: 2605.20613 by Cai Zhou, Changling Liu, Chenyu Wang, Guan Wang, Luca Scimeca, Shuai Zhen, Yasin Abbasi Yadkori, Yifei Wu, Yuhao Sun.

Figure 1
Figure 1. Figure 1: Pretraining efficiency. Trained from scratch in 1.9 days on 16 GPUs, HRM-Text 1B achieved performance competitive with substantially larger 2–7B foundation models while utilizing up to 432× less compute and 900× fewer training tokens. † Corresponding author. ∗ Equal Contribution. Contact: research@sapient.inc. Code available at: github.com/sapientinc/HRM-Text 1 arXiv:2605.20613v1 [cs.CL] 20 May 2026 [PITH… view at source ↗
Figure 2
Figure 2. Figure 2: HRM-Text architecture. (a) Dual-timescale recurrent design comprising L and H mod￾ules. (b) L/H module internals featuring MagicNorm—PreNorm blocks followed by final norm. (c) Sigmoid-gated multi head self-attention. (d) PrefixLM mask enabling bidirectional attention on instruction. HRM-Text builds upon an improved HRM architecture, featuring a dual-timescale recurrence4 . The forward pass is initialized w… view at source ↗
Figure 3
Figure 3. Figure 3: Task-completion and PrefixLM improve response modeling. (a) Compared with full causal language modeling P(x), response-only training P(xa|xq) lowers response-token NLL. Pre￾fixLM further improves response loss. (b) PrefixLM increases layerwise attention entropy relative to causal masking, suggesting broader use of the prompt. (c) Attention maps illustrate the qualita￾tive difference: causal attention remai… view at source ↗
Figure 4
Figure 4. Figure 4: Effective depth analysis. (a) Each layer of HRM consistently reveals considerable changes compared to its previous layer, showing that deep layers of HRM are still making mean￾ingful contributions to the hidden states. (b) HRM has smaller cosine similarity of block-wise representations, while other model variants suffer more from the common layer representation over-smoothing issue, analogously to standard… view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer logit lens KL. HRM shows the largest logit len KL in deep layers, while both standard and looped transformers converges to stable distributions in shallow layers. Type Datasets Tokens Docs Condition General instructions FLAN14, Tasksource 37, NoRobots 38 138.7B 379.9M direct / cot Rewritten Wikipedia knowledge SYNTH39 21.7B 60.8M synth, direct / cot Math and reasoning Platypus 40, Principia 41, O… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Full BPTT exhibits rare but substantially larger gradient-magnitude spikes compared [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mechanistic evidence for multiplicative gradient instability. Left: Jacobian growth increases with deeper backward cycling, consistent with stronger amplification through products of loop Jacobians. Right: paired full-vs-truncated gradient magnitudes show that full BPTT produces rare, disproportionately large gradient events at the same diagnostic checkpoints. The truncation setting used in our experiments… view at source ↗
Figure 8
Figure 8. Figure 8: Gradient stability comparison between RINs, HRM, and the Universal Transformer. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples slow-evolving strategic and fast-evolving execution layers. It adds MagicNorm and warmup deep credit assignment to stabilize deep recurrence, and replaces raw-text next-token prediction with exclusive training on instruction-response pairs under a task-completion objective and PrefixLM masking. The central claim is that a 1B-parameter model trained from scratch on 40 billion unique tokens at a $1,500 budget reaches 60.7% MMLU, 81.9% ARC-C, 82.2% DROP, 84.5% GSM8K, and 56.2% MATH, performing competitively with 2-7B open models while using 100-900x fewer tokens and 96-432x less compute.

Significance. If the results hold after proper verification, the work would be significant for demonstrating that architecture-objective co-design can substantially lower the compute-to-performance ratio in pretraining, providing an existence proof that foundational capabilities need not require internet-scale raw text. This could broaden access to pretraining research beyond large labs.

major comments (3)
  1. Abstract: the headline benchmark numbers (60.7% MMLU, 81.9% ARC-C, etc.) are stated without training hyperparameters, number of runs, error bars, statistical tests, or direct baselines trained on the identical 40B instruction-response corpus, rendering the central performance claims unverifiable and preventing attribution to HRM, MagicNorm, or the task-completion objective.
  2. Method (description of training objective): the assertion that instruction-response pairs plus PrefixLM masking supply a sufficient pretraining signal for general language and reasoning is load-bearing for the headline claim, yet no ablation isolates data source from architecture or objective, and no decontamination analysis is supplied to rule out leakage into MMLU/MATH/GSM8K.
  3. Results section: the comparison to 'standard baselines' (2-7B models) uses 100-900x fewer tokens, but the manuscript does not report the exact token counts, model sizes, or training objectives of those baselines, weakening the efficiency claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of verifiability and attribution that we have addressed through targeted revisions and clarifications. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: Abstract: the headline benchmark numbers (60.7% MMLU, 81.9% ARC-C, etc.) are stated without training hyperparameters, number of runs, error bars, statistical tests, or direct baselines trained on the identical 40B instruction-response corpus, rendering the central performance claims unverifiable and preventing attribution to HRM, MagicNorm, or the task-completion objective.

    Authors: We agree that the abstract requires additional supporting information for full verifiability. In the revised manuscript we have expanded the abstract to report key hyperparameters (learning rate, batch size, sequence length), the number of independent runs (three), and standard deviations on the primary metrics. We have also added a brief reference to statistical significance testing in the results section. Direct baselines trained from scratch on the identical 40B instruction-response corpus were not feasible within our budget; we therefore retain comparisons to published results of open models while adding explicit discussion of how the HRM architecture and task-completion objective are hypothesized to drive the gains, supported by the internal controls now reported in the appendix. revision: partial

  2. Referee: Method (description of training objective): the assertion that instruction-response pairs plus PrefixLM masking supply a sufficient pretraining signal for general language and reasoning is load-bearing for the headline claim, yet no ablation isolates data source from architecture or objective, and no decontamination analysis is supplied to rule out leakage into MMLU/MATH/GSM8K.

    Authors: We have added an ablation in the supplementary material that trains the same HRM architecture on raw-text next-token prediction versus the instruction-response task-completion objective with PrefixLM masking, isolating the contribution of the data source and objective. We have also included a decontamination analysis (n-gram overlap and exact-match checks against the evaluation sets) showing negligible leakage (<0.05 % of tokens); these results are now summarized in Section 3.4 with full details in the appendix. revision: yes

  3. Referee: Results section: the comparison to 'standard baselines' (2-7B models) uses 100-900x fewer tokens, but the manuscript does not report the exact token counts, model sizes, or training objectives of those baselines, weakening the efficiency claim.

    Authors: We have revised the results section and added a dedicated comparison table that cites the exact training token counts, model sizes, and objectives reported in the original publications of the baseline models (e.g., Llama-2 7B on 2 T tokens, Mistral 7B on 1 T tokens). This makes the 100-900x token reduction and 96-432x compute reduction claims directly traceable to published figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical existence proof via a new hierarchical recurrent architecture (HRM) with MagicNorm and warmup credit assignment, trained on instruction-response pairs under a task-completion objective with PrefixLM masking. Reported benchmark scores (MMLU, ARC-C, etc.) are independent held-out evaluations and do not reduce to any fitted parameters, self-definitions, or self-citation chains by construction. No equations, predictions, or uniqueness theorems are shown that equate outputs to inputs; the central claim rests on experimental outcomes rather than tautological re-labeling or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified assumption that the new architectural components and objective produce general capabilities; no free parameters, axioms, or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Biological systems demonstrate highly sample-efficient learning through multi-timescale processing such as the frontoparietal loop
    Invoked as direct inspiration for decoupling computation into slow strategic and fast execution layers.
invented entities (2)
  • MagicNorm no independent evidence
    purpose: Stabilize deep recurrence for language modeling
    New normalization technique introduced to address stability issues in the hierarchical recurrent structure.
  • Hierarchical Recurrent Model (HRM) no independent evidence
    purpose: Decouple computation into slow-evolving strategic and fast-evolving execution layers
    Core architectural replacement for standard Transformers.

pith-pipeline@v0.9.0 · 5819 in / 1385 out tokens · 58905 ms · 2026-05-21T05:38:03.711850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 15 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  3. [3]

    Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler

    Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: Unifying language learning paradigms. InInternational Conference on Learning Representations, 2023

  4. [4]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

  5. [5]

    Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

    Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181

  6. [6]

    Investigating recurrent transformers with dynamic halt.arXiv preprint arXiv:2402.00976, 2024

    Jishnu Ray Chowdhury and Cornelia Caragea. Investigating recurrent transformers with dynamic halt.arXiv preprint arXiv:2402.00976, 2024

  7. [7]

    Block- recurrent transformers

    DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block- recurrent transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 33248–33261, 2022

  8. [8]

    Recurrent neural networks: Vanishing and exploding gradients are not the end of the story

    Nicolas Zucchet and Antonio Orvieto. Recurrent neural networks: Vanishing and exploding gradients are not the end of the story. InAdvances in Neural Information Processing Systems, 2024

  9. [9]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

  10. [10]

    Understanding the difficulty of training transformers

    Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5747–5763, 2020

  11. [11]

    Unbiasing Truncated Backpropagation Through Time

    Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time.arXiv preprint arXiv:1705.08209, 2017

  12. [12]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022

  13. [13]

    Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2022. 16

  14. [14]

    Le, Barret Zoph, Jason Wei, and Adam Roberts

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V . Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 2263...

  15. [15]

    Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks, 2014. URLhttps://arxiv.org/abs/1409.3215

  16. [16]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  17. [17]

    Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer

    Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. InInternational Conference on Learning Representations, 2018

  18. [18]

    Unified language model pre-training for natural language understanding and generation

    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. InAdvances in Neural Information Processing Systems, volume 32, 2019

  19. [19]

    Llama 3: State-of-the-art open weight language models

    Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta,

  20. [20]

    URLhttps://ai.meta.com/llama/

  21. [21]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  22. [22]

    Gemma 3 technical report, 2025

    Gemma Team. Gemma 3 technical report, 2025. URLhttps://arxiv.org/abs/2503. 19786

  23. [23]

    Olmo 3

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  24. [24]

    Scaling latent reasoning via looped language models, 2025

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

  25. [25]

    Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

    Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps: //openreview.net/forum...

  26. [26]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  27. [27]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  28. [28]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 17

  29. [29]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38: 100092–100118, 2026

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38: 100092–100118, 2026

  30. [30]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

  31. [31]

    Izhikevich

    Eugene M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling.Cerebral Cortex, 17(10):2443–2452, 2007. doi: 10.1093/cercor/bhl152

  32. [32]

    A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning.Nature neuroscience, 25(8): 1082–1092, 2022

    Ryunosuke Amo, Sara Matias, Akihiro Yamanaka, Kenji F Tanaka, Naoshige Uchida, and Mitsuko Watabe-Uchida. A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning.Nature neuroscience, 25(8): 1082–1092, 2022

  33. [33]

    Jeffrey L. Elman. Learning and development in neural networks: The importance of starting small.Cognition, 48(1):71–99, 1993. doi: 10.1016/0010-0277(93)90058-4

  34. [34]

    Universal transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. InInternational Conference on Learning Representations, 2019

  35. [35]

    Recursive inference scaling: A winning path to scalable inference in language and multimodal systems.Advances in Neural Information Processing Systems, 38:109020–109049, 2026

    Ibrahim Alabdulmohsin and Xiaohua Zhai. Recursive inference scaling: A winning path to scalable inference in language and multimodal systems.Advances in Neural Information Processing Systems, 38:109020–109049, 2026

  36. [36]

    Less is More: Recursive Reasoning with Tiny Networks

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URL https://arxiv.org/abs/2510.04871

  37. [37]

    What affects the effective depth of large language models?arXiv preprint arXiv:2512.14064, 2025

    Yi Hu, Cai Zhou, and Muhan Zhang. What affects the effective depth of large language models?arXiv preprint arXiv:2512.14064, 2025

  38. [38]

    tasksource: A large collection of nlp tasks with a structured dataset preprocessing framework

    Damien Sileo. tasksource: A large collection of nlp tasks with a structured dataset preprocessing framework. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024

  39. [39]

    No robots.https://huggingface.co/datasets/HuggingFaceH4/no_ robots, 2023

    HuggingFace H4. No robots.https://huggingface.co/datasets/HuggingFaceH4/no_ robots, 2023. Dataset card

  40. [40]

    Pleias/synth · datasets at hugging face, 2025

    PleIAs. Pleias/synth · datasets at hugging face, 2025. URLhttps://huggingface.co/ datasets/PleIAs/SYNTH

  41. [41]

    Lee, Cole J

    Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms, 2023

  42. [42]

    Reasoning over mathematical objects: On-policy reward modeling and test time aggregation, 2026

    Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, and Wenting Zhao. Reasoning over mathematical objects: On-policy...

  43. [43]

    Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data

    Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. InInternational Conference on Learning Representations, 2025. 18

  44. [44]

    Numinamath.https://huggingface.co/ datasets/AI-MO/NuminaMath-CoT, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.https://huggingface.co/ datasets/AI-MO/NuminaMath-CoT, 2024

  45. [45]

    Omni-math: A universal olympiad level mathematic benchmark for large language models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Zhengyang Tang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. InInternational Conference on Learning Representations, volume 2025, pages 100540–100569, 2025

  46. [46]

    Analysing mathematical reasoning abilities of neural models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. InInternational Conference on Learning Representations, 2019

  47. [47]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  48. [48]

    Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

    Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. InAdvances in neural information processing systems, volume 38, pages 110320–110345, 2026

  49. [49]

    Openthoughts: Data recipes for reasoning models, 2025

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, et al. Openthoughts: Data recipes for reasoning models, 2025

  50. [50]

    Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

    Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

  51. [51]

    Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions

    Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. InAdvances in Neural Information Processing Systems, volume 38, 2026

  52. [52]

    General- reasoner: Advancing llm reasoning across all domains

    Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains. InAdvances in Neural Information Processing Systems, volume 38, pages 56596–56618, 2026

  53. [53]

    Openchat: Advancing open-source language models with mixed-quality data

    Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=AOJyfhWYHf

  54. [54]

    Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf

    Yi Dong, Zhilin Wang, Makesh Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 11275–11288, 2023

  55. [55]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. doi: 10.48550/arXiv.2501.12948. URL https://arxiv.org/abs/2501.12948. 19

  56. [56]

    Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023

    Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023

  57. [57]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  58. [58]

    Scaling exponents across parameterizations and optimizers

    Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InForty-first International Conference on Machine Learning, 2024

  59. [59]

    Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps:...

  60. [60]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

  61. [61]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in neural information processing systems, volume 33, pages 1877–1901, 2020

  62. [62]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  63. [63]

    Midtraining bridges pretraining and posttraining distributions, 2025

    Emmy Liu, Graham Neubig, and Chenyan Xiong. Midtraining bridges pretraining and posttraining distributions, 2025

  64. [64]

    Mid-training of large language models: A survey, 2025

    Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng. Mid-training of large language models: A survey, 2025

  65. [65]

    The compute divide in machine learning: A threat to academic contribution and scrutiny?arXiv preprint arXiv:2401.02452, 2024

    Tamay Besiroglu, Sage Andrus Bergerson, Amelia Michael, Lennart Heim, Xueyun Luo, and Neil Thompson. The compute divide in machine learning: A threat to academic contribution and scrutiny?arXiv preprint arXiv:2401.02452, 2024

  66. [66]

    The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research.arXiv preprint arXiv:2010.15581, 2020

    Nur Ahmed and Muntasir Wahed. The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research.arXiv preprint arXiv:2010.15581, 2020

  67. [67]

    Learning phrase representations using RNN encoder– decoder for statistical machine translation

    Kyunghyun Cho, Bart van Merriënboer, Ça ˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734. Association for Computational...

  68. [68]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations, 2015

  69. [69]

    Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25:1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25:1–53, 2024

  70. [70]

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. InInternational Conference on Learning Representations, 2025

  71. [71]

    Weston, and Yuandong Tian

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E. Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. In Conference on Language Modeling, 2025

  72. [72]

    Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts, 2025

    Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts, 2025

  73. [73]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

  74. [74]

    Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner

    Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner. InForty-third International Conference on Machine Learning, 2026

  75. [75]

    Products of many large random matrices and gradients in deep neural networks: B

    Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks: B. hanin, m. nica.Communications in Mathematical Physics, 376(1):287– 322, 2020

  76. [76]

    Neural gradients are near-lognormal: Improved quantized and sparse training.arXiv preprint arXiv:2006.08173, 2020

    Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, and Daniel Soudry. Neural gradients are near-lognormal: Improved quantized and sparse training.arXiv preprint arXiv:2006.08173, 2020

  77. [77]

    Liam Hodgkinson and Michael W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. InProceedings of the 38th International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 4262–4272. PMLR,

  78. [78]

    URLhttps://proceedings.mlr.press/v139/hodgkinson21a.html

  79. [79]

    Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024

  80. [80]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 21 Appendix A FLOPs estimation For dense models, we use the standard training-FLOPs estimateF= 6N D. For recurrent models, we account separately for the forward and backward recurrent unrolls. We count2N Dfor forward computation and4N Dfor backward comp...