pith. machine review for the scientific record. sign in

arxiv: 2510.25741 · v4 · submitted 2025-10-29 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Scaling Latent Reasoning via Looped Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords looped language modelslatent reasoningiterative computationentropy regularizationknowledge manipulationpre-trainingreasoning alignment
0
0 comments X

The pith

Looped language models match up to 12B model performance with 1.4B and 2.6B parameters by reasoning iteratively in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ouro, a series of looped language models that incorporate iterative computation in latent space into the pre-training phase. This is achieved through an entropy-regularized objective that enables learned allocation of computational depth. The resulting 1.4B and 2.6B models perform on par with much larger state-of-the-art models across benchmarks, attributing the gains to enhanced knowledge manipulation abilities rather than greater knowledge capacity. The looped approach also produces reasoning traces that align more closely with the final model outputs than traditional explicit chain-of-thought methods. This points to a potential new direction for scaling reasoning capabilities in language models.

Core claim

By using looped language models with iterative latent space computation and an entropy-regularized objective, pre-training can directly build in reasoning capabilities. Scaled to 7.7T tokens, the 1.4B and 2.6B Ouro models match the performance of up to 12B SOTA LLMs on a wide range of benchmarks. This advantage comes from superior knowledge manipulation rather than increased capacity, and the models yield reasoning traces more aligned with outputs than explicit CoT.

What carries the argument

Looped Language Models performing iterative computation in latent space with entropy regularization for depth allocation.

If this is right

  • Smaller models achieve high performance through better manipulation of knowledge instead of larger capacity.
  • Reasoning is integrated into pre-training rather than added later via prompting or fine-tuning.
  • Internal reasoning traces align better with final answers than those from explicit chain-of-thought.
  • This offers a new scaling direction focused on latent iterative computation.
  • The entropy objective allows models to allocate more depth to complex problems automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the latent reasoning scales well, it could lead to models that handle longer or more complex reasoning chains without explicit guidance.
  • This method might reduce reliance on large post-training datasets for reasoning skills.
  • Combining looped pre-training with other efficiency techniques could further improve model performance per parameter.

Load-bearing premise

The performance improvements are caused by the latent iterative computation and the entropy-regularized objective, not by differences in training data or other training details.

What would settle it

A direct comparison where a standard non-looped model is trained on the exact same 7.7T tokens with matching optimization but without the looping mechanism, to see if it matches the Ouro benchmark scores.

read the original abstract

Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Ouro, a family of pre-trained Looped Language Models (LoopLM) that incorporate iterative computation in latent space and an entropy-regularized objective for learned depth allocation during pre-training on 7.7T tokens. It claims that the 1.4B and 2.6B models achieve performance matching up to 12B SOTA LLMs on a wide range of benchmarks, with the advantage attributed to superior knowledge manipulation capabilities rather than increased knowledge capacity, as shown through controlled experiments. Additionally, LoopLM yields reasoning traces more aligned with final outputs than explicit CoT, and the models are open-sourced.

Significance. If the central claims hold under rigorous scrutiny, this represents a promising new direction for scaling reasoning in language models by embedding iterative latent computation into pre-training. The open release of models trained at this scale is a positive contribution that could facilitate further research into latent reasoning mechanisms.

major comments (1)
  1. [Abstract] Abstract: The key claim that the performance advantage 'stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities' is load-bearing but rests on controlled experiments whose details are not provided in the manuscript. Specifically, it is unclear if the baseline models were trained on the same 7.7T tokens with identical data, optimization, and initialization, which is necessary to isolate the effect of the LoopLM architecture.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate clarifications to strengthen the presentation of our controlled experiments.

read point-by-point responses
  1. Referee: The key claim that the performance advantage 'stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities' is load-bearing but rests on controlled experiments whose details are not provided in the manuscript. Specifically, it is unclear if the baseline models were trained on the same 7.7T tokens with identical data, optimization, and initialization, which is necessary to isolate the effect of the LoopLM architecture.

    Authors: We agree that explicit details on the controlled experiments are essential to substantiate the claim. The manuscript describes these experiments in Section 4.2, where the baseline transformer models were trained from scratch on the exact same 7.7T token corpus, using identical data ordering, optimizer hyperparameters, and random initialization as the Ouro models. To address the referee's concern directly, we will revise the abstract to include a concise statement on the matched training setup and expand Section 4.2 with a dedicated paragraph enumerating the shared data, optimization, and initialization protocols. This will make the isolation of the LoopLM architecture's effect on knowledge manipulation fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical benchmarks

full rationale

The paper presents an empirical architecture (LoopLM with latent iteration and entropy-regularized depth allocation) trained on 7.7T tokens, then reports benchmark results against external SOTA models up to 12B parameters. No derivation chain, equations, or self-citations are shown that would reduce the claimed performance gains or 'superior manipulation' to fitted parameters or self-referential definitions by construction. Controlled experiments are invoked to isolate the mechanism from data/optimization confounds, but the provided text contains no mathematical reduction or load-bearing self-citation that collapses the central claim. Results remain falsifiable against external benchmarks and do not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

With only the abstract available, specific free parameters such as the exact entropy coefficient and detailed axioms are not extractable; the work rests on standard transformer assumptions and the effectiveness of the proposed objective.

axioms (1)
  • standard math Standard transformer language model assumptions hold for the looped variant
    The architecture extends existing LLM components without new formal proofs.
invented entities (1)
  • Looped Language Model (LoopLM) with latent iteration no independent evidence
    purpose: Enable iterative computation in latent space for built-in reasoning
    New model family introduced to embed reasoning into pre-training.

pith-pipeline@v0.9.0 · 5614 in / 1244 out tokens · 48056 ms · 2026-05-15T07:37:35.132217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/EightTick eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs

  • Foundation/DiscretenessForcing discreteness_forcing_principle echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities

  • Foundation/LogicAsFunctionalEquation RCL_is_unique_functional_form_of_logic echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    LoopLM yields reasoning traces more aligned with final outputs than explicit CoT

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  2. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  3. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  4. Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.

  5. SMolLM: Small Language Models Learn Small Molecular Grammar

    cs.LG 2026-05 unverdicted novelty 7.0

    A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

  6. LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

    cs.IR 2026-04 unverdicted novelty 7.0

    LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.

  7. A Mechanistic Analysis of Looped Reasoning Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

  8. N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

  9. Sparse Layers are Critical to Scaling Looped Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  10. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...

  11. The Power of Power Law: Asymmetry Enables Compositional Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    Power-law data sampling creates beneficial asymmetry in the loss landscape that lets models acquire high-frequency skill compositions first, enabling more efficient learning of rare long-tail skills than uniform distr...

  12. Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

  13. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.

  14. LASER: Low-Rank Activation SVD for Efficient Recursion

    cs.LG 2026-04 unverdicted novelty 6.0

    LASER tracks low-rank activation subspaces in recursive models via matrix-free SVD updates and fidelity resets to save 60% memory without accuracy loss.

  15. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  16. Relational Preference Encoding in Looped Transformer Internal States

    cs.LG 2026-04 conditional novelty 6.0

    Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own...

  17. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

    cs.CL 2026-04 unverdicted novelty 6.0

    Recurrent-depth transformers achieve systematic generalization and depth extrapolation on implicit reasoning tasks through iterative layer reuse, a three-stage grokking process, and inference-time scaling, while vanil...

  18. Hyperloop Transformers

    cs.LG 2026-04 unverdicted novelty 5.0

    Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

  19. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 5.0

    LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.

  20. NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence

    q-bio.NC 2026-04 unverdicted novelty 3.0

    Workshop report identifies AI gaps in physical interaction, brittle learning, and energy inefficiency, then proposes neuroscience principles and a research roadmap for NeuroAI.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 19 Pith papers · 26 internal anchors

  1. [1]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  2. [2]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2:3, 2024

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 27

  5. [5]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  6. [6]

    Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

  7. [7]

    Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025

    Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416, 2025

  8. [8]

    Can looped trans- formers learn to implement multi-step gradient descent for in-context learning?arXiv preprint arXiv:2410.08292, 2024

    Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped trans- formers learn to implement multi-step gradient descent for in-context learning?arXiv preprint arXiv:2410.08292, 2024

  9. [9]

    On the role of depth and looping for in-context learning with task diversity.arXiv preprint arXiv:2410.21698, 2024

    Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. On the role of depth and looping for in-context learning with task diversity.arXiv preprint arXiv:2410.21698, 2024

  10. [10]

    Transformers learn to implement multi-step gradient descent with chain of thought.arXiv preprint arXiv:2502.21212, 2025

    Jianhao Huang, Zixuan Wang, and Jason D Lee. Transformers learn to implement multi-step gradient descent with chain of thought.arXiv preprint arXiv:2502.21212, 2025

  11. [11]

    A little depth goes a long way: The expressive power of log-depth transformers

    William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025

  12. [12]

    Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025

    William Merrill and Ashish Sabharwal. Exact expressive power of transformers with padding.arXiv preprint arXiv:2505.18948, 2025

  13. [13]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In International Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  14. [14]

    Looped transformers are better at learning learning algorithms

    Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023

  15. [15]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018

  16. [16]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

  17. [17]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025

  18. [18]

    Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

    Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, and Zhouhan Lin. Pretraining language models to ponder in continuous space.arXiv preprint arXiv:2505.20674, 2025

  19. [19]

    A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

    Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, et al. A survey on latent reasoning.arXiv preprint arXiv:2507.06203, 2025

  20. [20]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe TwelfthInternational Conference on Learning Representations

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe TwelfthInternational Conference on Learning Representations

  21. [21]

    Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025

    Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842, 2025

  22. [22]

    Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524, 2025

  23. [23]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

  24. [24]

    Recurrent stacking of layers for compact neural machine translation models

    Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299, 2019. 28

  25. [25]

    Lessons on parameter sharing across layers in transformers.arXiv preprint arXiv:2104.06022, 2021

    Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers.arXiv preprint arXiv:2104.06022, 2021

  26. [26]

    Megrez2 technical report.arXiv preprint arXiv:2507.17728, 2025

    Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, et al. Megrez2 technical report.arXiv preprint arXiv:2507.17728, 2025

  27. [27]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  28. [28]

    Cotformer: More tokens with attention make up for less depth

    Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: More tokens with attention make up for less depth. InWorkshopon AdvancingNeural Network Training: Computational Efficiency,Scalability,and Resource Optimization (WANT@NeurIPS 2023), 2023

  29. [29]

    Efficient pretraining length scaling

    Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, and Xun Zhou. Efficient pretraining length scaling. arXiv preprint arXiv:2504.14992, 2025

  30. [30]

    Physics of language models: Part 2.1, grade-school math and the hidden reasoning process.arXiv preprint arXiv:2407.20311, 2024

    Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process.arXiv preprint arXiv:2407.20311, 2024

  31. [31]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  32. [32]

    Pondernet: Learning to ponder

    Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder. arXiv preprint arXiv:2107.05407, 2021

  33. [33]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  34. [34]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  35. [35]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  36. [36]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

  37. [37]

    Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

  38. [38]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advancesin Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advancesin Neural Information Processing Systems, 37:30811–30849, 2024

  39. [39]

    Datacomp-lm: In search of the next generation of training sets for language models

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024

  40. [40]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset

    Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. arXiv preprint arXiv:2412.02595, 2024

  41. [41]

    Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data

    Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, et al. Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data. arXiv preprint arXiv:2505.05427, 2025

  42. [42]

    Chinese tiny llm: Pretraining a chinese-centric large language model.arXiv preprint arXiv:2404.04167, 2024

    Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, et al. Chinese tiny llm: Pretraining a chinese-centric large language model.arXiv preprint arXiv:2404.04167, 2024

  43. [43]

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. Opencoder: The open cookbook for top-tier code large language models. 2024

  44. [44]

    Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P. Xing. Megamath: Pushing the limits of open math corpora.arXiv preprint arXiv:2504.02807, 2025. Preprint. 29

  45. [45]

    Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset

    Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-cc-math: A 133 billion-token-scale high quality math pretraining dataset. 2025

  46. [46]

    NVIDIA, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, AkhiadBercovich, AkshayHazare, AlejandraRico, AleksanderFicek, AlexKondratenko, AlexShaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal,...

  47. [47]

    How to train long-context language models (effectively)

    Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). arXiv preprint arXiv:2410.02660, 2024

  48. [48]

    Flame: Flash language modeling made easy, January 2025

    Yu Zhang and Songlin Yang. Flame: Flash language modeling made easy, January 2025

  49. [49]

    Torchtitan: One-stop pytorch native solution for production ready LLM pretraining

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learning Representations, 2025

  50. [50]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  51. [51]

    Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy

    Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy. arXiv preprint arXiv:2506.13284, 2025

  52. [52]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  53. [53]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025. 30

  54. [54]

    Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

    Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, et al. Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

  55. [55]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models.arXiv preprint arXiv:2403.13372, 2024

  56. [56]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  57. [57]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  58. [58]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  59. [59]

    Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventhConference on Neural Information Processing Systems, 2023

  60. [60]

    Aime 2024.https://huggingface.co/datasets/HuggingFaceH4/aime_2024, 2024

    HuggingFaceH4. Aime 2024.https://huggingface.co/datasets/HuggingFaceH4/aime_2024, 2024. 30 problems from AIME I & II 2024

  61. [61]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

  62. [62]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

  63. [63]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    M-A-P Team, Xinrun Du, Yifan Yao, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

  64. [64]

    Beyondaime.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025

    ByteDance-Seed. Beyondaime.https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025. CC0-1.0 license

  65. [65]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  66. [66]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 4149–4158, Minneapolis, M...

  67. [67]

    Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. In Proceedings of the 13th International Conference on Learning Representations, ICLR ’25, April 2025. Full version available athttps://ssrn.com/abstract=5250617

  68. [68]

    Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

    Zeyuan Allen-Zhu. Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers. SSRN Electronic Journal, May 2025.https://ssrn.com/abstract=5240330

  69. [69]

    Language models can learn implicit multi-hop reasoning, but only if they have lots of training data.arXiv preprint arXiv:2505.17923, 2025

    Yuekun Yao, Yupei Du, Dawei Zhu, Michael Hahn, and Alexander Koller. Language models can learn implicit multi-hop reasoning, but only if they have lots of training data.arXiv preprint arXiv:2505.17923, 2025

  70. [70]

    Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

  71. [71]

    Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025

    Shu Zhong, Mingyu Xu, Tenglong Ao, and Guang Shi. Understanding transformer from the perspective of associative memory.arXiv preprint arXiv:2505.19488, 2025. 31

  72. [72]

    On prompt-driven safeguarding for large language models

    Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. InProceedings of the 41st International Conference on Machine Learning, pages 61593–61613, 2024

  73. [73]

    Post-hoc reasoning in chain of thought, December 2024

    Kyle Cox. Post-hoc reasoning in chain of thought, December 2024. Blog post

  74. [74]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv: 2503.08679, 2025

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv: 2503.08679, 2025

  75. [75]

    Chain-of-thought is not explainability

    Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio. Chain-of-thought is not explainability. 2025

  76. [76]

    Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv: 2507.11473, 2025

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksand...

  77. [77]

    Quora question pairs.https://www.kaggle.com/competitions/quora-question-pairs/, 2017

    Quora. Quora question pairs.https://www.kaggle.com/competitions/quora-question-pairs/, 2017. Kaggle competition

  78. [78]

    Understanding transformer reasoning capabilities via graph algorithms

    Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, and Vahab Mirrokni. Understanding transformer reasoning capabilities via graph algorithms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, p...

  79. [79]

    Transformers, parallel computation, and logarithmic depth

    Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth. arXiv preprint arXiv:2402.09268, 2024

  80. [80]

    Transformers learn shortcuts to automata

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

Showing first 80 references.