HRM-Text: Efficient Pretraining Beyond Scaling
Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3
The pith
A 1B hierarchical recurrent model trained on 40 billion instruction tokens reaches 60.7% on MMLU and competitive scores on reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic layers and fast-evolving execution layers, stabilized by MagicNorm and warmup deep credit assignment, can be trained from scratch exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking to reach 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH after only 40 billion unique tokens.
What carries the argument
The Hierarchical Recurrent Model (HRM) that separates slow strategic and fast execution layers, together with MagicNorm stabilization and warmup deep credit assignment, used under a task-completion objective on instruction-response pairs with PrefixLM masking.
If this is right
- Pretraining from scratch becomes feasible for groups without access to massive raw-text corpora or large compute clusters.
- Co-design of recurrent hierarchy and task-completion objective can reduce required training tokens by roughly two orders of magnitude while preserving benchmark performance.
- Instruction-response data alone can serve as the primary signal for acquiring both language understanding and multi-step reasoning.
- The compute-to-performance ratio improves enough that open research groups can iterate on foundational models without industrial budgets.
Where Pith is reading between the lines
- If the instruction-only signal proves sufficient at larger scales, the field could shift away from scraping raw web text toward curated task-oriented datasets that reduce noise and bias.
- The slow-fast recurrence pattern may transfer to other sequence domains such as code or scientific text once the credit-assignment stabilization is adapted.
- A natural next test is whether the same HRM backbone, when scaled to 7-13B parameters on the same 40B tokens, widens the gap over standard transformers or saturates.
- The approach invites direct comparison of next-token versus task-completion objectives on identical data to isolate how much of the efficiency gain comes from the objective versus the architecture.
Load-bearing premise
Training exclusively on instruction-response pairs with a task-completion objective supplies a sufficient pretraining signal for general language and reasoning capabilities.
What would settle it
Training a standard transformer from scratch on the exact same 40 billion instruction-response tokens and finding that it matches or exceeds the HRM-Text scores on the same suite of benchmarks.
Figures
read the original abstract
The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples slow-evolving strategic and fast-evolving execution layers. It adds MagicNorm and warmup deep credit assignment to stabilize deep recurrence, and replaces raw-text next-token prediction with exclusive training on instruction-response pairs under a task-completion objective and PrefixLM masking. The central claim is that a 1B-parameter model trained from scratch on 40 billion unique tokens at a $1,500 budget reaches 60.7% MMLU, 81.9% ARC-C, 82.2% DROP, 84.5% GSM8K, and 56.2% MATH, performing competitively with 2-7B open models while using 100-900x fewer tokens and 96-432x less compute.
Significance. If the results hold after proper verification, the work would be significant for demonstrating that architecture-objective co-design can substantially lower the compute-to-performance ratio in pretraining, providing an existence proof that foundational capabilities need not require internet-scale raw text. This could broaden access to pretraining research beyond large labs.
major comments (3)
- Abstract: the headline benchmark numbers (60.7% MMLU, 81.9% ARC-C, etc.) are stated without training hyperparameters, number of runs, error bars, statistical tests, or direct baselines trained on the identical 40B instruction-response corpus, rendering the central performance claims unverifiable and preventing attribution to HRM, MagicNorm, or the task-completion objective.
- Method (description of training objective): the assertion that instruction-response pairs plus PrefixLM masking supply a sufficient pretraining signal for general language and reasoning is load-bearing for the headline claim, yet no ablation isolates data source from architecture or objective, and no decontamination analysis is supplied to rule out leakage into MMLU/MATH/GSM8K.
- Results section: the comparison to 'standard baselines' (2-7B models) uses 100-900x fewer tokens, but the manuscript does not report the exact token counts, model sizes, or training objectives of those baselines, weakening the efficiency claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of verifiability and attribution that we have addressed through targeted revisions and clarifications. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: Abstract: the headline benchmark numbers (60.7% MMLU, 81.9% ARC-C, etc.) are stated without training hyperparameters, number of runs, error bars, statistical tests, or direct baselines trained on the identical 40B instruction-response corpus, rendering the central performance claims unverifiable and preventing attribution to HRM, MagicNorm, or the task-completion objective.
Authors: We agree that the abstract requires additional supporting information for full verifiability. In the revised manuscript we have expanded the abstract to report key hyperparameters (learning rate, batch size, sequence length), the number of independent runs (three), and standard deviations on the primary metrics. We have also added a brief reference to statistical significance testing in the results section. Direct baselines trained from scratch on the identical 40B instruction-response corpus were not feasible within our budget; we therefore retain comparisons to published results of open models while adding explicit discussion of how the HRM architecture and task-completion objective are hypothesized to drive the gains, supported by the internal controls now reported in the appendix. revision: partial
-
Referee: Method (description of training objective): the assertion that instruction-response pairs plus PrefixLM masking supply a sufficient pretraining signal for general language and reasoning is load-bearing for the headline claim, yet no ablation isolates data source from architecture or objective, and no decontamination analysis is supplied to rule out leakage into MMLU/MATH/GSM8K.
Authors: We have added an ablation in the supplementary material that trains the same HRM architecture on raw-text next-token prediction versus the instruction-response task-completion objective with PrefixLM masking, isolating the contribution of the data source and objective. We have also included a decontamination analysis (n-gram overlap and exact-match checks against the evaluation sets) showing negligible leakage (<0.05 % of tokens); these results are now summarized in Section 3.4 with full details in the appendix. revision: yes
-
Referee: Results section: the comparison to 'standard baselines' (2-7B models) uses 100-900x fewer tokens, but the manuscript does not report the exact token counts, model sizes, or training objectives of those baselines, weakening the efficiency claim.
Authors: We have revised the results section and added a dedicated comparison table that cites the exact training token counts, model sizes, and objectives reported in the original publications of the baseline models (e.g., Llama-2 7B on 2 T tokens, Mistral 7B on 1 T tokens). This makes the 100-900x token reduction and 96-432x compute reduction claims directly traceable to published figures. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical existence proof via a new hierarchical recurrent architecture (HRM) with MagicNorm and warmup credit assignment, trained on instruction-response pairs under a task-completion objective with PrefixLM masking. Reported benchmark scores (MMLU, ARC-C, etc.) are independent held-out evaluations and do not reduce to any fitted parameters, self-definitions, or self-citation chains by construction. No equations, predictions, or uniqueness theorems are shown that equate outputs to inputs; the central claim rests on experimental outcomes rather than tautological re-labeling or load-bearing self-references.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Biological systems demonstrate highly sample-efficient learning through multi-timescale processing such as the frontoparietal loop
invented entities (2)
-
MagicNorm
no independent evidence
-
Hierarchical Recurrent Model (HRM)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HRM decouples computation into slow-evolving strategic and fast-evolving execution layers... MagicNorm and warmup deep credit assignment... task-completion objective and PrefixLM masking.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
train exclusively on instruction-response pairs using a task-completion objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[2]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: Unifying language learning paradigms. InInternational Conference on Learning Representations, 2023
work page 2023
-
[4]
Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181
-
[6]
Investigating recurrent transformers with dynamic halt.arXiv preprint arXiv:2402.00976, 2024
Jishnu Ray Chowdhury and Cornelia Caragea. Investigating recurrent transformers with dynamic halt.arXiv preprint arXiv:2402.00976, 2024
-
[7]
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block- recurrent transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 33248–33261, 2022
work page 2022
-
[8]
Recurrent neural networks: Vanishing and exploding gradients are not the end of the story
Nicolas Zucchet and Antonio Orvieto. Recurrent neural networks: Vanishing and exploding gradients are not the end of the story. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[9]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020
work page 2020
-
[10]
Understanding the difficulty of training transformers
Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5747–5763, 2020
work page 2020
-
[11]
Unbiasing Truncated Backpropagation Through Time
Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time.arXiv preprint arXiv:1705.08209, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022
work page 2022
-
[13]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. InInternational Conference on Learning Representations, 2022. 16
work page 2022
-
[14]
Le, Barret Zoph, Jason Wei, and Adam Roberts
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V . Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 2263...
work page 2023
-
[15]
Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks, 2014. URLhttps://arxiv.org/abs/1409.3215
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020
work page 2020
-
[17]
Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. InInternational Conference on Learning Representations, 2018
work page 2018
-
[18]
Unified language model pre-training for natural language understanding and generation
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[19]
Llama 3: State-of-the-art open weight language models
Meta AI. Llama 3: State-of-the-art open weight language models. Technical report, Meta,
-
[20]
URLhttps://ai.meta.com/llama/
-
[21]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Gemma 3 technical report, 2025
Gemma Team. Gemma 3 technical report, 2025. URLhttps://arxiv.org/abs/2503. 19786
work page 2025
-
[23]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Scaling latent reasoning via looped language models, 2025
Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...
work page 2025
-
[25]
Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein
Jonas Geiping, Sean Michael McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps: //openreview.net/forum...
work page 2026
-
[26]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
work page 2019
-
[27]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[28]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 17
work page 2024
-
[29]
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38: 100092–100118, 2026
work page 2026
-
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017
work page 2017
-
[31]
Eugene M. Izhikevich. Solving the distal reward problem through linkage of STDP and dopamine signaling.Cerebral Cortex, 17(10):2443–2452, 2007. doi: 10.1093/cercor/bhl152
-
[32]
Ryunosuke Amo, Sara Matias, Akihiro Yamanaka, Kenji F Tanaka, Naoshige Uchida, and Mitsuko Watabe-Uchida. A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning.Nature neuroscience, 25(8): 1082–1092, 2022
work page 2022
-
[33]
Jeffrey L. Elman. Learning and development in neural networks: The importance of starting small.Cognition, 48(1):71–99, 1993. doi: 10.1016/0010-0277(93)90058-4
-
[34]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. InInternational Conference on Learning Representations, 2019
work page 2019
-
[35]
Ibrahim Alabdulmohsin and Xiaohua Zhai. Recursive inference scaling: A winning path to scalable inference in language and multimodal systems.Advances in Neural Information Processing Systems, 38:109020–109049, 2026
work page 2026
-
[36]
Less is More: Recursive Reasoning with Tiny Networks
Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URL https://arxiv.org/abs/2510.04871
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
What affects the effective depth of large language models?arXiv preprint arXiv:2512.14064, 2025
Yi Hu, Cai Zhou, and Muhan Zhang. What affects the effective depth of large language models?arXiv preprint arXiv:2512.14064, 2025
-
[38]
tasksource: A large collection of nlp tasks with a structured dataset preprocessing framework
Damien Sileo. tasksource: A large collection of nlp tasks with a structured dataset preprocessing framework. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024
work page 2024
-
[39]
No robots.https://huggingface.co/datasets/HuggingFaceH4/no_ robots, 2023
HuggingFace H4. No robots.https://huggingface.co/datasets/HuggingFaceH4/no_ robots, 2023. Dataset card
work page 2023
-
[40]
Pleias/synth · datasets at hugging face, 2025
PleIAs. Pleias/synth · datasets at hugging face, 2025. URLhttps://huggingface.co/ datasets/PleIAs/SYNTH
work page 2025
-
[41]
Ariel N. Lee, Cole J. Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms, 2023
work page 2023
-
[42]
Reasoning over mathematical objects: On-policy reward modeling and test time aggregation, 2026
Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, and Wenting Zhao. Reasoning over mathematical objects: On-policy...
work page 2026
-
[43]
Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data
Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data. InInternational Conference on Learning Representations, 2025. 18
work page 2025
-
[44]
Numinamath.https://huggingface.co/ datasets/AI-MO/NuminaMath-CoT, 2024
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath.https://huggingface.co/ datasets/AI-MO/NuminaMath-CoT, 2024
work page 2024
-
[45]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Zhengyang Tang, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. InInternational Conference on Learning Representations, volume 2025, pages 100540–100569, 2025
work page 2025
-
[46]
Analysing mathematical reasoning abilities of neural models
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. InInternational Conference on Learning Representations, 2019
work page 2019
-
[47]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021
work page 2021
-
[48]
Acereason-nemotron: Advancing math and code reasoning through reinforcement learning
Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. InAdvances in neural information processing systems, volume 38, pages 110320–110345, 2026
work page 2026
-
[49]
Openthoughts: Data recipes for reasoning models, 2025
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, et al. Openthoughts: Data recipes for reasoning models, 2025
work page 2025
-
[50]
Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025
Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025
work page 2025
-
[51]
Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions
Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. InAdvances in Neural Information Processing Systems, volume 38, 2026
work page 2026
-
[52]
General- reasoner: Advancing llm reasoning across all domains
Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General- reasoner: Advancing llm reasoning across all domains. InAdvances in Neural Information Processing Systems, volume 38, pages 56596–56618, 2026
work page 2026
-
[53]
Openchat: Advancing open-source language models with mixed-quality data
Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=AOJyfhWYHf
work page 2024
-
[54]
Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf
Yi Dong, Zhilin Wang, Makesh Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 11275–11288, 2023
work page 2023
-
[55]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. doi: 10.48550/arXiv.2501.12948. URL https://arxiv.org/abs/2501.12948. 19
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[56]
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023
work page 2023
-
[57]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Scaling exponents across parameterizations and optimizers
Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Roman Novak, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[59]
Torchtitan: One-stop pytorch native solution for production ready LLM pretraining
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps:...
work page 2025
-
[60]
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in neural information processing systems, volume 33, pages 1877–1901, 2020
work page 1901
-
[62]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Midtraining bridges pretraining and posttraining distributions, 2025
Emmy Liu, Graham Neubig, and Chenyan Xiong. Midtraining bridges pretraining and posttraining distributions, 2025
work page 2025
-
[64]
Mid-training of large language models: A survey, 2025
Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng. Mid-training of large language models: A survey, 2025
work page 2025
-
[65]
Tamay Besiroglu, Sage Andrus Bergerson, Amelia Michael, Lennart Heim, Xueyun Luo, and Neil Thompson. The compute divide in machine learning: A threat to academic contribution and scrutiny?arXiv preprint arXiv:2401.02452, 2024
-
[66]
Nur Ahmed and Muntasir Wahed. The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research.arXiv preprint arXiv:2010.15581, 2020
-
[67]
Learning phrase representations using RNN encoder– decoder for statistical machine translation
Kyunghyun Cho, Bart van Merriënboer, Ça ˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734. Association for Computational...
-
[68]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InInternational Conference on Learning Representations, 2015
work page 2015
-
[69]
Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25:1–53, 2024
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25:1–53, 2024
work page 2024
-
[70]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. InInternational Conference on Learning Representations, 2025
work page 2025
-
[71]
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E. Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. In Conference on Language Modeling, 2025
work page 2025
-
[72]
Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts, 2025
Yeskendir Koishekenov, Aldo Lipani, and Nicola Cancedda. Encode, think, decode: Scaling test-time reasoning with recursive latent thoughts, 2025
work page 2025
-
[73]
Scaling Latent Reasoning via Looped Language Models
Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner
Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, and Dinghuai Zhang. Coevolutionary continuous discrete diffusion: Make your diffusion language model a latent reasoner. InForty-third International Conference on Machine Learning, 2026
work page 2026
-
[75]
Products of many large random matrices and gradients in deep neural networks: B
Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks: B. hanin, m. nica.Communications in Mathematical Physics, 376(1):287– 322, 2020
work page 2020
-
[76]
Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, and Daniel Soudry. Neural gradients are near-lognormal: Improved quantized and sparse training.arXiv preprint arXiv:2006.08173, 2020
-
[77]
Liam Hodgkinson and Michael W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. InProceedings of the 38th International Conference on Machine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 4262–4272. PMLR,
-
[78]
URLhttps://proceedings.mlr.press/v139/hodgkinson21a.html
-
[79]
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37:52996–53021, 2024
work page 2024
-
[80]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 21 Appendix A FLOPs estimation For dense models, we use the standard training-FLOPs estimateF= 6N D. For recurrent models, we account separately for the forward and backward recurrent unrolls. We count2N Dfor forward computation and4N Dfor backward comp...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.