Recognition: 3 theorem links
· Lean TheoremScaling Latent Reasoning via Looped Language Models
Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3
The pith
Looped language models match up to 12B model performance with 1.4B and 2.6B parameters by reasoning iteratively in latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using looped language models with iterative latent space computation and an entropy-regularized objective, pre-training can directly build in reasoning capabilities. Scaled to 7.7T tokens, the 1.4B and 2.6B Ouro models match the performance of up to 12B SOTA LLMs on a wide range of benchmarks. This advantage comes from superior knowledge manipulation rather than increased capacity, and the models yield reasoning traces more aligned with outputs than explicit CoT.
What carries the argument
Looped Language Models performing iterative computation in latent space with entropy regularization for depth allocation.
If this is right
- Smaller models achieve high performance through better manipulation of knowledge instead of larger capacity.
- Reasoning is integrated into pre-training rather than added later via prompting or fine-tuning.
- Internal reasoning traces align better with final answers than those from explicit chain-of-thought.
- This offers a new scaling direction focused on latent iterative computation.
- The entropy objective allows models to allocate more depth to complex problems automatically.
Where Pith is reading between the lines
- If the latent reasoning scales well, it could lead to models that handle longer or more complex reasoning chains without explicit guidance.
- This method might reduce reliance on large post-training datasets for reasoning skills.
- Combining looped pre-training with other efficiency techniques could further improve model performance per parameter.
Load-bearing premise
The performance improvements are caused by the latent iterative computation and the entropy-regularized objective, not by differences in training data or other training details.
What would settle it
A direct comparison where a standard non-looped model is trained on the exact same 7.7T tokens with matching optimization but without the looping mechanism, to see if it matches the Ouro benchmark scores.
read the original abstract
Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ouro, a family of pre-trained Looped Language Models (LoopLM) that incorporate iterative computation in latent space and an entropy-regularized objective for learned depth allocation during pre-training on 7.7T tokens. It claims that the 1.4B and 2.6B models achieve performance matching up to 12B SOTA LLMs on a wide range of benchmarks, with the advantage attributed to superior knowledge manipulation capabilities rather than increased knowledge capacity, as shown through controlled experiments. Additionally, LoopLM yields reasoning traces more aligned with final outputs than explicit CoT, and the models are open-sourced.
Significance. If the central claims hold under rigorous scrutiny, this represents a promising new direction for scaling reasoning in language models by embedding iterative latent computation into pre-training. The open release of models trained at this scale is a positive contribution that could facilitate further research into latent reasoning mechanisms.
major comments (1)
- [Abstract] Abstract: The key claim that the performance advantage 'stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities' is load-bearing but rests on controlled experiments whose details are not provided in the manuscript. Specifically, it is unclear if the baseline models were trained on the same 7.7T tokens with identical data, optimization, and initialization, which is necessary to isolate the effect of the LoopLM architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate clarifications to strengthen the presentation of our controlled experiments.
read point-by-point responses
-
Referee: The key claim that the performance advantage 'stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities' is load-bearing but rests on controlled experiments whose details are not provided in the manuscript. Specifically, it is unclear if the baseline models were trained on the same 7.7T tokens with identical data, optimization, and initialization, which is necessary to isolate the effect of the LoopLM architecture.
Authors: We agree that explicit details on the controlled experiments are essential to substantiate the claim. The manuscript describes these experiments in Section 4.2, where the baseline transformer models were trained from scratch on the exact same 7.7T token corpus, using identical data ordering, optimizer hyperparameters, and random initialization as the Ouro models. To address the referee's concern directly, we will revise the abstract to include a concise statement on the matched training setup and expand Section 4.2 with a dedicated paragraph enumerating the shared data, optimization, and initialization protocols. This will make the isolation of the LoopLM architecture's effect on knowledge manipulation fully transparent. revision: yes
Circularity Check
No significant circularity; claims rest on independent empirical benchmarks
full rationale
The paper presents an empirical architecture (LoopLM with latent iteration and entropy-regularized depth allocation) trained on 7.7T tokens, then reports benchmark results against external SOTA models up to 12B parameters. No derivation chain, equations, or self-citations are shown that would reduce the claimed performance gains or 'superior manipulation' to fitted parameters or self-referential definitions by construction. Controlled experiments are invoked to isolate the mechanism from data/optimization confounds, but the provided text contains no mathematical reduction or load-bearing self-citation that collapses the central claim. Results remain falsifiable against external benchmarks and do not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard transformer language model assumptions hold for the looped variant
invented entities (1)
-
Looped Language Model (LoopLM) with latent iteration
no independent evidence
Lean theorems connected to this paper
-
Foundation/EightTickeight_tick_forces_D3 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs
-
Foundation/DiscretenessForcingdiscreteness_forcing_principle echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities
-
Foundation/LogicAsFunctionalEquationRCL_is_unique_functional_form_of_logic echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
LoopLM yields reasoning traces more aligned with final outputs than explicit CoT
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
A Mechanistic Analysis of Looped Reasoning Language Models
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
-
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
The Power of Power Law: Asymmetry Enables Compositional Reasoning
Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
-
LASER: Low-Rank Activation SVD for Efficient Recursion
LASER tracks low-rank activation subspaces in recursive models via matrix-free SVD updates and fidelity resets to save 60% memory without accuracy loss.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
Relational Preference Encoding in Looped Transformer Internal States
Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own...
-
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Recurrent-depth transformers achieve systematic generalization and depth extrapolation on implicit reasoning tasks through iterative layer reuse, a three-stage grokking process, and inference-time scaling, while vanil...
-
Hyperloop Transformers
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
-
NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence
Workshop report identifies AI gaps in physical interaction, brittle learning, and energy inefficiency, then proposes neuroscience principles and a research roadmap for NeuroAI.
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[2]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2:3, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 27
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[6]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[7]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025
-
[8]
Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped trans- formers learn to implement multi-step gradient descent for in-context learning?arXiv preprint arXiv:2410.08292, 2024
-
[9]
Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. On the role of depth and looping for in-context learning with task diversity.arXiv preprint arXiv:2410.21698, 2024
-
[10]
Jianhao Huang, Zixuan Wang, and Jason D Lee. Transformers learn to implement multi-step gradient descent with chain of thought.arXiv preprint arXiv:2502.21212, 2025
-
[11]
A little depth goes a long way: The expressive power of log-depth transformers
William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025
-
[12]
Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025
William Merrill and Ashish Sabharwal. Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025
-
[13]
Looped transformers as programmable computers
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398–11442. PMLR, 2023
work page 2023
-
[14]
Looped transformers are better at learning learning algorithms
Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023
-
[15]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024
-
[17]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, and Zhouhan Lin. Pretraining language models to ponder in continuous space.arXiv preprint arXiv:2505.20674, 2025
-
[19]
A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025
-
[20]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe TwelfthInternational Conference on Learning Representations
-
[21]
Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025
-
[22]
Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025
-
[23]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[24]
Recurrent stacking of layers for compact neural machine translation models
Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299, 2019. 28
work page 2019
-
[25]
Lessons on parameter sharing across layers in transformers.arXiv preprint arXiv:2104.06022, 2021
Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers.arXiv preprint arXiv:2104.06022, 2021
-
[26]
Megrez2 technical report.arXiv preprint arXiv:2507.17728, 2025
Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, et al. Megrez2 technical report.arXiv preprint arXiv:2507.17728, 2025
-
[27]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Cotformer: More tokens with attention make up for less depth
Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: More tokens with attention make up for less depth. InWorkshopon AdvancingNeural Network Training: Computational Efficiency,Scalability,and Resource Optimization (WANT@NeurIPS 2023), 2023
work page 2023
-
[29]
Efficient pretraining length scaling
Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, and Xun Zhou. Efficient pretraining length scaling. arXiv preprint arXiv:2504.14992, 2025
-
[30]
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process.arXiv preprint arXiv:2407.20311, 2024
-
[31]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. arXiv preprint arXiv:2107.05407, 2021
-
[33]
Attention is all you need.Advancesin neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017
work page 2017
-
[34]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[35]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[36]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024
-
[38]
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advancesin Neural Information Processing Systems, 37:30811–30849, 2024
work page 2024
-
[39]
Datacomp-lm: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024
work page 2024
-
[40]
Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. arXiv preprint arXiv:2412.02595, 2024
-
[41]
Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data
Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, et al. Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data. arXiv preprint arXiv:2505.05427, 2025
-
[42]
Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, et al. Chinese tiny llm: Pretraining a chinese-centric large language model.arXiv preprint arXiv:2404.04167, 2024
-
[43]
Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models. 2024
work page 2024
- [44]
-
[45]
Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset
Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset. 2025
work page 2025
-
[46]
NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, AkhiadBercovich, AkshayHazare, AlejandraRico, AleksanderFicek, AlexKondratenko, AlexShaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal,...
work page 2025
-
[47]
How to train long-context language models (effectively)
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660, 2024
-
[48]
Flame: Flash language modeling made easy, January 2025
Yu Zhang and Songlin Yang. Flame: Flash language modeling made easy, January 2025
work page 2025
-
[49]
Torchtitan: One-stop pytorch native solution for production ready LLM pretraining
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[50]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy
Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy. arXiv preprint arXiv:2506.13284, 2025
-
[52]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025
-
[53]
Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025. 30
-
[54]
Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025
Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, et al. Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025
-
[55]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[59]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventhConference on Neural Information Processing Systems, 2023
work page 2023
-
[60]
Aime 2024.https://huggingface.co/datasets/HuggingFaceH4/aime_2024, 2024
HuggingFaceH4. Aime 2024.https://huggingface.co/datasets/HuggingFaceH4/aime_2024, 2024. 30 problems from AIME I & II 2024
work page 2024
-
[61]
Chaoqun He, Renjie Luo, Yuzhuo Bai, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
M-A-P Team, Xinrun Du, Yifan Yao, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025
-
[64]
Beyondaime.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025
ByteDance-Seed. Beyondaime.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025. CC0-1.0 license
work page 2025
-
[65]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 4149–4158, Minneapolis, M...
work page 2019
-
[67]
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. In Proceedings of the 13th International Conference on Learning Representations, ICLR ’25, April 2025. Full version available athttps://ssrn.com/abstract=5250617
work page 2025
-
[68]
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Zeyuan Allen-Zhu. Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers. SSRN Electronic Journal, May 2025.https://ssrn.com/abstract=5240330
work page 2025
-
[69]
Yuekun Yao, Yupei Du, Dawei Zhu, Michael Hahn, and Alexander Koller. Language models can learn implicit multi-hop reasoning, but only if they have lots of training data.arXiv preprint arXiv:2505.17923, 2025
-
[70]
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025
-
[71]
Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025. 31
-
[72]
On prompt-driven safeguarding for large language models
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. InProceedings of the 41st International Conference on Machine Learning, pages 61593–61613, 2024
work page 2024
-
[73]
Post-hoc reasoning in chain of thought, December 2024
Kyle Cox. Post-hoc reasoning in chain of thought, December 2024. Blog post
work page 2024
-
[74]
Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv: 2503.08679, 2025
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv: 2503.08679, 2025
-
[75]
Chain-of-thought is not explainability
Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio. Chain-of-thought is not explainability. 2025
work page 2025
-
[76]
Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...
-
[77]
Quora question pairs.https://www.kaggle.com/competitions/quora-question-pairs/, 2017
Quora. Quora question pairs.https://www.kaggle.com/competitions/quora-question-pairs/, 2017. Kaggle competition
work page 2017
-
[78]
Understanding transformer reasoning capabilities via graph algorithms
Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, and Vahab Mirrokni. Understanding transformer reasoning capabilities via graph algorithms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, p...
work page 2024
-
[79]
Transformers, parallel computation, and logarithmic depth
Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth. arXiv preprint arXiv:2402.09268, 2024
-
[80]
Transformers learn shortcuts to automata
Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.