arxiv: 2502.05171 · v2 · submitted 2025-02-07 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Abhinav Bhatele, Bhavya Kailkhura, Brian R. Bartoldson, John Kirchenbauer, Jonas Geiping, Neel Jain, Sean McLeish, Siddharth Singh, Tom Goldstein

Pith reviewed 2026-05-12 15:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords recurrent language modelstest-time computelatent reasoningreasoning benchmarksmodel architectureinference scaling

0 comments

The pith

A language model scales test-time reasoning by repeatedly applying one recurrent block in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an architecture that iterates a single recurrent block during inference, unrolling it to any chosen depth to perform additional computation inside the model's hidden states. This differs from token-based scaling methods such as chain-of-thought, which expand output length and often require special training data. The authors train the model to 3.5 billion parameters on 800 billion tokens and report that extra iterations raise scores on reasoning benchmarks, reaching levels comparable to models with up to 50 billion parameters. A reader would care because the method promises to increase effective compute without larger models, longer contexts, or verbose outputs, and because it may support reasoning that is difficult to express in words.

Core claim

Iterating a recurrent block at test time allows the model to perform implicit reasoning steps in latent space, producing measurable gains on reasoning benchmarks that grow with the number of iterations and reach performance equivalent to a model fifty billion parameters larger.

What carries the argument

A recurrent block that is applied repeatedly at inference time, thereby unrolling the network to variable depth while operating entirely in the model's internal latent representations.

If this is right

Reasoning performance can be scaled at test time without increasing the number of output tokens generated.
The approach requires no chain-of-thought style training data or expanded context windows.
Types of reasoning that resist verbal description can still be captured inside the latent iterations.
A fixed-size model can deliver compute levels equivalent to much larger models by choosing how many iterations to run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models using this architecture could allocate compute dynamically, running more iterations only on difficult inputs.
The same recurrent block could be inserted into existing transformer models to add a latent-reasoning mode without full retraining.
If the gains hold on broader task suites, training compute could be traded for inference compute in future model design.

Load-bearing premise

Repeated applications of the same block produce genuine additional reasoning steps rather than merely adding non-informative computation or fitting to benchmark patterns.

What would settle it

If further iterations after a modest number cease to improve accuracy on held-out reasoning tasks or begin to degrade it, the claim that the iterations perform useful latent reasoning would be falsified.

read the original abstract

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 3.5B model that iterates a shared recurrent block at test time claims reasoning gains matching much larger models, but the evidence that those iterations deliver genuine latent steps rather than extra FLOPs is still thin.

read the letter

The core claim is straightforward: train a language model with a recurrent block, then at inference unroll that block for more steps to get better reasoning without extra tokens or chain-of-thought data. They scale this to 3.5 billion parameters on 800 billion tokens and report benchmark improvements that sometimes look like what a 50-billion-parameter model would deliver. The architecture keeps context windows small and does not rely on verbalized reasoning, which is the main practical difference from current test-time scaling approaches. That framing is new enough to stand out from standard recurrent or depth-scaling work. If the gains are real, it points to a useful efficiency knob where you spend inference steps instead of parameters or context length. The paper does a clean job describing how the recurrent block is integrated and why it could handle reasoning that is hard to express in words. Credit for shipping a working 3.5B-scale proof of concept with reported scaling behavior. The soft spot is exactly where the stress test flags it. The abstract gives no ablations against a non-recurrent model that uses the same total FLOPs, no scaling curves that isolate iteration count from other factors, and no clear accounting of how the 50B-equivalent load is measured. Without those controls it remains possible that the extra iterations are simply giving the model more capacity it already had during training or that the gains are benchmark-specific rather than general latent reasoning. The full paper will need to show that the improvements survive those checks. This is for readers who track alternative scaling directions beyond tokens or model size. Anyone working on inference-time compute budgets or compact reasoning models will want to see the details. It is coherent on its own terms and deserves a serious referee to press on the experimental controls and measurement choices. I would send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces a recurrent-depth language model architecture that scales test-time computation by repeatedly applying a shared recurrent block, unrolling to arbitrary depth in latent space rather than generating additional tokens. The approach requires no specialized chain-of-thought training data and works with small context windows. The authors train a 3.5B-parameter model on 800B tokens and report that increased test-time iterations yield performance gains on reasoning benchmarks, sometimes reaching levels claimed to be equivalent to a 50B-parameter model.

Significance. If the empirical results hold under proper controls, the work offers a concrete alternative to token-based test-time scaling and could enable more efficient capture of non-verbalizable reasoning steps. The scaling of the proof-of-concept to 3.5B parameters and 800B tokens demonstrates practical feasibility and provides initial evidence that recurrent unrolling can improve benchmark scores. These strengths are tempered by the absence of detailed ablations and compute-equivalence measurements in the current manuscript.

major comments (2)

Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.
Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.

minor comments (1)

The abstract would benefit from a brief statement of the recurrent block's parameter sharing and how depth is controlled at inference time to help readers immediately distinguish the method from standard transformer scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments below and will incorporate revisions to clarify the equivalence claim, add supporting analyses, and strengthen the evidence for latent-space reasoning.

read point-by-point responses

Referee: Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.

Authors: We agree that the equivalence claim requires a precise definition and additional controls. In the revised manuscript we will explicitly define equivalence via total inference FLOPs (comparing recurrent unrolling compute to the forward pass of a 50B model), report statistical significance tests on the benchmark gains, and add ablations against non-recurrent baselines allocated identical extra compute. These changes will make the central claim transparent and reproducible. revision: yes
Referee: Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.

Authors: We acknowledge the need for stronger evidence. The revised version will include performance scaling curves versus iteration count, direct comparisons to feed-forward models matched on total FLOPs, and controls that vary depth in non-recurrent architectures while holding training data and parameters fixed. These additions will help isolate the contribution of recurrence and reduce concerns about redundancy or overfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling results with no derivation chain

full rationale

The paper introduces a recurrent-depth architecture that iterates a shared block at test time to scale compute in latent space, reporting empirical gains on reasoning benchmarks equivalent to much larger models. No equations, derivations, fitted parameters, or uniqueness theorems are presented in the provided text. The central claim rests entirely on experimental outcomes rather than any self-definitional reduction, fitted-input prediction, or load-bearing self-citation. This is the expected non-finding for an architecture paper whose value is demonstrated by benchmarks, not by mathematical construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that latent-space iteration adds useful reasoning capacity; no free parameters, invented entities, or additional axioms are visible from the abstract alone.

axioms (1)

domain assumption Iterating the recurrent block performs additional useful computation equivalent to deeper reasoning
This assumption is required for the test-time scaling claim to hold but is not derived or justified in the provided abstract.

pith-pipeline@v0.9.0 · 5454 in / 1114 out tokens · 47545 ms · 2026-05-12T15:35:14.010717+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/EightTick.lean or DimensionForcing.lean eight_tick_forces_D3 or alexander_duality_circle_linking echoes
Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stability and Generalization in Looped Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics
cs.LG 2026-05 unverdicted novelty 7.0

Bifurcation models represent set-valued solution maps via weight-tied equilibrium dynamics whose attractors encode multiple solutions, with a proof that broad locally Lipschitz set-valued maps admit regular dynamical ...
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
cs.LG 2026-05 conditional novelty 7.0

Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
SMolLM: Small Language Models Learn Small Molecular Grammar
cs.LG 2026-05 unverdicted novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
A Mechanistic Analysis of Looped Reasoning Language Models
cs.LG 2026-04 unverdicted novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
Training Large Language Models to Reason in a Continuous Latent Space
cs.CL 2024-12 unverdicted novelty 7.0

Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Factorized Latent Reasoning for LLM-based Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

FLR factorizes latent reasoning into multiple preference factors using multi-factor attention and regularizations, outperforming baselines on recommendation benchmarks while adding robustness and interpretability.
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
cs.CV 2026-04 unverdicted novelty 6.0

MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
cs.CV 2026-04 unverdicted novelty 6.0

A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
cs.LG 2026-04 unverdicted novelty 6.0

C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
Dream 7B: Diffusion Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
LEPO: Latent Reasoning Policy Optimization for Large Language Models
cs.LG 2026-04 unverdicted novelty 5.0

LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · cited by 28 Pith papers · 32 internal anchors

[1]

Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, and Samy Bengio. 2023. https://doi.org/10.48550/arXiv.2310.08866 Adaptivity and Modularity for Efficient Generalization Over Task Complexity . arxiv:2310.08866[cs]

work page doi:10.48550/arxiv.2310.08866 2023
[2]

AI2. 2024. https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d OLMo 1.7-- 7B : A 24 point improvement on MMLU

work page 2024
[3]

Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of language models: Part 3.1, knowledge storage and extraction. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pages 1067--1077, Vienna, Austria. JMLR.org

work page 2024
[4]

S.-I. Amari. 1972. https://doi.org/10.1109/T-C.1972.223477 Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements . IEEE Transactions on Computers, C-21(11):1197--1206

work page doi:10.1109/t-c.1972.223477 1972
[5]

AMD. 2021. https://www.amd.com/en/products/accelerators/instinct/mi200/mi250x.html AMD Instinct ™ MI250X Accelerators

work page 2021
[6]

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319

work page arXiv 2019
[7]

Zico Kolter

Brandon Amos and J. Zico Kolter. 2017. http://proceedings.mlr.press/v70/amos17a.html OptNet : Differentiable Optimization as a Layer in Neural Networks . In International Conference on Machine Learning , pages 136--145

work page 2017
[8]

Zico Kolter, and Roger Baker Grosse

Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, and Roger Baker Grosse. 2022. https://openreview.net/forum?id=kgT6D7Z4Xv9 Path Independent Equilibrium Models Can Better Exploit Test-Time Computation . In Advances in Neural Information Processing Systems

work page 2022
[9]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. https://openreview.net/forum?id=4WnqRR915j Llemma: An Open Language Model for Mathematics . In The Twelfth International Conference on Learning Representations

work page 2023
[11]

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. 2024. https://doi.org/10.48550/arXiv.2410.20672 Relaxed Recursive Transformers : Effective Parameter Sharing with Layer-wise LoRA

work page doi:10.48550/arxiv.2410.20672 2024
[12]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2019. https://arxiv.org/abs/1909.01377 Deep Equilibrium Models . In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc

work page arXiv 2019
[13]

Zico Kolter

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. 2022. https://openreview.net/forum?id=B0oHOwT5ENL Neural Deep Equilibrium Solvers . In International Conference on Learning Representations

work page 2022
[14]

Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.48550/arXiv.2408.07055 LongWriter : Unleashing 10,000+ Word Generation from Long Context LLMs . arxiv:2408.07055[cs]

work page doi:10.48550/arxiv.2408.07055 2024
[15]

Andrea Banino, Jan Balaguer, and Charles Blundell. 2021. https://openreview.net/forum?id=1EuxRTe0WN PonderNet : Learning to Ponder . In 8th ICML Workshop on Automated Machine Learning ( AutoML )

work page 2021
[16]

Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. 2022. https://openreview.net/forum?id=PPjSKy40XUB End-to-end Algorithm Synthesis with Recurrent Networks : Extrapolation without Overthinking . In Advances in Neural Information Processing Systems

work page 2022
[17]

Bauschke, Sarah M

Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang. 2011. https://arxiv.org/abs/1101.4688 Firmly nonexpansive mappings and maximally monotone operators: Correspondence and duality . arXiv:1101.4688 [math]

work page arXiv 2011
[18]

Jay Bear, Adam Pr \"u gel-Bennett , and Jonathon Hare. 2024. https://doi.org/10.48550/arXiv.2410.23451 Rethinking Deep Thinking : Stable Learning of Algorithms using Lipschitz Constraints . arxiv:2410.23451[cs]

work page doi:10.48550/arxiv.2410.23451 2024
[19]

Stas Bekman. 2023. https://github.com/stas00/ml-engineering Machine Learning Engineering Open Book . Stasosphere Online Inc

work page 2023
[20]

Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra . 2024. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus SmolLM-corpus

work page 2024
[21]

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal . 2023. https://doi.org/10.48550/arXiv.2304.01373 Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ...

work page doi:10.48550/arxiv.2304.01373 2023
[22]

Lessons from the trenches on reproducible evaluation of language models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, and 11 others. 2024. https://doi.org/10.48550/arXiv....

work page doi:10.48550/arxiv.2405.14782 2024
[23]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

work page 2020
[24]

Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022. https://openaccess.thecvf.com/content/CVPR2022/html/Boudiaf\_Parameter-Free\_Online\_Test-Time\_Adaptation\_CVPR\_2022\_paper.html Parameter- Free Online Test-Time Adaptation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pages 8344--8353

work page 2022
[25]

Valentino Braitenberg. 1986. Vehicles: Experiments in Synthetic Psychology . MIT press

work page 1986
[26]

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. 2024. https://doi.org/10.48550/arXiv.2405.12981 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention . arxiv:2405.12981[cs]

work page doi:10.48550/arxiv.2405.12981 2024
[27]

British Library Labs . 2021. https://doi.org/10.23636/r7w6-zy15 Digitised Books . c. 1510 - c. 1900. JSONL ( OCR Derived Text + Metadata) . British Library

work page doi:10.23636/r7w6-zy15 2021
[28]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. https://openreview.net/forum?id=PEpbUobfJv Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads . In Forty-First International Conference on Machine Learning

work page 2024
[29]

character.ai . 2024. https://research.character.ai/optimizing-inference/ Optimizing AI Inference at Character . AI

work page 2024
[30]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Jeffrey Cheng and Benjamin Van Durme. 2024. https://doi.org/10.48550/arXiv.2412.13171 Compressed Chain of Thought : Efficient Reasoning Through Dense Representations . arxiv:2412.13171[cs]

work page doi:10.48550/arxiv.2412.13171 2024
[32]

Euirim Choi. 2023. https://www.github.com/euirim/goodwiki GoodWiki dataset

work page 2023
[33]

Fran c ois Chollet. 2019. https://doi.org/10.48550/arXiv.1911.01547 On the Measure of Intelligence . arxiv:1911.01547[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.01547 2019
[34]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, and 48 others. 2022. https://arxiv.org/abs/2204.02311 PaLM :...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://doi.org/10.48550/arXiv.2110.14168 Training Verifiers to Solve Math Word Problems . arxiv:2110.14168[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168 2021
[37]

Owen Colegrove, Vik Paruchuri, and OpenPhi-Team . 2024. https://huggingface.co/datasets/open-phi/textbooks Open-phi/textbooks Datasets at Hugging Face

work page 2024
[38]

R \'o bert Csord \'a s, Kazuki Irie, J \"u rgen Schmidhuber, Christopher Potts, and Christopher D. Manning. 2024. https://openreview.net/forum?id=ZxVrkm7Bjl&noteId=xzoi2mTLOI MoEUT : Mixture-of-Experts Universal Transformers . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[39]

Gautier Dagan. 2024. https://github.com/gautierdag/bpeasy Bpeasy

work page 2024
[40]

Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozi \`e re. 2024. https://arxiv.org/abs/2402.01035 Getting the most out of your tokenizer for pre-training and domain adaptation . arxiv:2402.01035[cs]

work page arXiv 2024
[41]

Tri Dao. 2023. https://doi.org/10.48550/arXiv.2307.08691 FlashAttention-2 : Faster Attention with Better Parallelism and Work Partitioning . arxiv:2307.08691[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691 2023
[42]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. https://doi.org/10.48550/arXiv.2205.14135 FlashAttention : Fast and Memory-Efficient Exact Attention with IO-Awareness . arxiv:2205.14135[cs]

work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
[43]

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://doi.org/10.48550/arXiv.2501.12948 DeepSeek-R1 : Incentivizing Reasoning Capability in LLMs via Reinfo...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[44]

DeepSeek-AI , Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2024. https://doi.org/10.48550/arXiv.2412.19437 DeepSeek-V3 Technical Report . arxiv:2412.19437[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[45]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. 2019. https://doi.org/10.48550/arXiv.1807.03819 Universal Transformers . arxiv:1807.03819[cs, stat]

work page internal anchor Pith review doi:10.48550/arxiv.1807.03819 2019
[46]

Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://doi.org/10.48550/arXiv.2405.14838 From Explicit CoT to Implicit CoT : Learning to Internalize CoT Step by Step . arxiv:2405.14838[cs]

work page doi:10.48550/arxiv.2405.14838 2024
[47]

Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. 2024. https://openreview.net/forum?id=kRxCDDFNpp Fewer Truncations Improve Language Modeling . In Forty-First International Conference on Machine Learning

work page 2024
[48]

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. https://proceedings.neurips.cc/paper/2021/hash/a4d92e2cd541fca87e4620aba658316d-Abstract.html CogView : Mastering Text-to-Image Generation via Transformers . In Advances in Neural Information Processing Systems , volume 34...

work page 2021
[49]

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2019. https://openreview.net/forum?id=SJg7KhVKPH Depth- Adaptive Transformer . In International Conference on Learning Representations

work page 2019
[50]

Aly, Beidi Chen, and Carole-Jean Wu

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A. Aly, Beidi Chen, and Carole-Jean Wu. 2024. https://doi.org/10.48550/arXiv.2404.16710 LayerSkip : Enabling Early Exit Inference and Self-Speculative Decoding . arxiv:2404.16710[cs]

work page doi:10.48550/arxiv.2404.16710 2024
[51]

and Novak, Roman and Liu, Peter J

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein , Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. 2024. https://doi.org/10.48550/arXiv.2407.05872 Scaling Exponents Across Parameterizations and Optimizers . arxiv:2407.05872[cs]

work page doi:10.48550/arxiv.2407.05872 2024
[52]

Angela Fan, Edouard Grave, and Armand Joulin. 2019. https://doi.org/10.48550/arXiv.1909.11556 Reducing Transformer Depth on Demand with Structured Dropout . arxiv:1909.11556[cs, stat]

work page doi:10.48550/arxiv.1909.11556 2019
[53]

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2021. https://doi.org/10.48550/arXiv.2002.09402 Addressing Some Limitations of Transformers with Feedback Memory . arxiv:2002.09402[cs, stat]

work page doi:10.48550/arxiv.2002.09402 2021
[54]

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. 2025. https://openreview.net/forum?id=2edigk8yoU Looped Transformers for Length Generalization . In The Thirteenth International Conference on Learning Representations

work page 2025
[55]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. https://doi.org/10.48550/arXiv.2101.03961 Switch Transformers : Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . arxiv:2101.03961[cs]

work page internal anchor Pith review doi:10.48550/arxiv.2101.03961 2022
[56]

Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. https://proceedings.neurips.cc/paper\_files/paper/2023/hash/16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets\_and\_Benchmarks.html ChessGPT : Bridging Policy Learning and Language Modeling . Advances in Neural Information Processing Syst...

work page 2023
[57]

Sebastian Gabarain. 2024. https://huggingface.co/datasets/Locutusque/hercules-v5.0 Locutusque/hercules-v5.0 Datasets at Hugging Face

work page 2024
[58]

Reddi, Stefanie Jegelka, and Sanjiv Kumar

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. 2024. https://doi.org/10.48550/arXiv.2410.08292 Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning ?

work page doi:10.48550/arxiv.2410.08292 2024
[59]

Jonas Geiping and Tom Goldstein. 2023. https://proceedings.mlr.press/v202/geiping23a.html Cramming: Training a Language Model on a single GPU in one day. In Proceedings of the 40th International Conference on Machine Learning , pages 11117--11143. PMLR

work page 2023
[60]

Jonas Geiping and Michael Moeller. 2019. https://arxiv.org/abs/1908.06209 Parametric Majorization for Data-Driven Energy Minimization Methods . In Proceedings of the IEEE International Conference on Computer Vision , pages 10262--10273

work page arXiv 2019
[61]

Gers and J

F.A. Gers and J. Schmidhuber. 2000. https://doi.org/10.1109/IJCNN.2000.861302 Recurrent nets that time and count . In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks . IJCNN 2000. Neural Computing : New Challenges and Perspectives for the New Millennium , volume 3, pages 189--194 vol.3

work page doi:10.1109/ijcnn.2000.861302 2000
[62]

Lee, and Dimitris Papailiopoulos

Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. 2023. https://proceedings.mlr.press/v202/giannou23a.html Looped Transformers as Programmable Computers . In Proceedings of the 40th International Conference on Machine Learning , pages 11398--11442. PMLR

work page 2023
[63]

Priya Goyal, Piotr Doll \'a r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. https://arxiv.org/abs/1706.02677 Accurate, Large Minibatch SGD : Training ImageNet in 1 Hour . arxiv:1706.02677[cs]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Alex Graves. 2017. https://doi.org/10.48550/arXiv.1603.08983 Adaptive Computation Time for Recurrent Neural Networks . arxiv:1603.08983[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1603.08983 2017
[65]

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. https://arxiv.org/abs/1410.5401 Neural Turing Machines . arxiv:1410.5401[cs]

work page internal anchor Pith review arXiv 2014
[66]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.48550/arXiv.2402.00838 OLMo : A...

work page doi:10.48550/arxiv.2402.00838 2024
[67]

Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. 2024. https://openreview.net/forum?id=ompl7supoX&referrer=\ In Workshop on Efficient Systems for Foundation Models II @ ICML2024

work page 2024
[68]

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. https://doi.org/10.48550/arXiv.2412.06769 Training Large Language Models to Reason in a Continuous Latent Space . arxiv:2412.06769[cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.06769 2024
[69]

Tamir David Hay and Lior Wolf. 2023. https://openreview.net/forum?id=d4uL2MSe0z Dynamic Layer Tying for Parameter-Efficient Transformers . In The Twelfth International Conference on Learning Representations

work page 2023
[70]

Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. 2024. https://doi.org/10.48550/arXiv.2402.13449 CAMELoT : Towards Large Language Models with Training-Free Consolidated Associative Memory . arxiv:2402.13449[cs]

work page doi:10.48550/arxiv.2402.13449 2024
[71]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

work page 2021
[72]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

work page 2021
[73]

J J Hopfield. 1982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/ Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8):2554--2558

work page 1982
[74]

Jiewen Hu, Thomas Zhu, and Sean Welleck. 2024. https://doi.org/10.48550/arXiv.2408.03350 miniCTX : Neural Theorem Proving with ( Long- ) Contexts . arxiv:2408.03350[cs]

work page doi:10.48550/arxiv.2408.03350 2024
[75]

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. http://www.scopus.com/inward/record.url?scp=85059432227&partnerID=8YFLogxK Averaging weights leads to wider optima and better generalization: 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 . 34th Conference on Uncertainty in Artific...

work page 2018
[76]

Jiang, Wenda Li, and Mateja Jamnik

Albert Q. Jiang, Wenda Li, and Mateja Jamnik. 2023. https://doi.org/10.48550/arXiv.2311.03755 Multilingual Mathematical Autoformalization . arxiv:2311.03755[cs]

work page doi:10.48550/arxiv.2311.03755 2023
[77]

Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions

work page 2017
[78]

Jean Kaddour. 2022. https://doi.org/10.48550/arXiv.2209.14981 Stop Wasting My Time ! Saving Days of ImageNet and BERT Training with Latest Weight Averaging . arxiv:2209.14981[cs, stat]

work page doi:10.48550/arxiv.2209.14981 2022
[79]

Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2024. https://doi.org/10.48550/arXiv.2410.05864 From Tokens to Words : On the Inner Lexicon of LLMs . arxiv:2410.05864[cs]

work page doi:10.48550/arxiv.2410.05864 2024
[80]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://doi.org/10.48550/arXiv.2001.08361 Scaling Laws for Neural Language Models . arxiv:2001.08361[cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020

Showing first 80 references.