pith. machine review for the scientific record. sign in

arxiv: 2502.05171 · v2 · submitted 2025-02-07 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Abhinav Bhatele, Bhavya Kailkhura, Brian R. Bartoldson, John Kirchenbauer, Jonas Geiping, Neel Jain, Sean McLeish, Siddharth Singh, Tom Goldstein

Pith reviewed 2026-05-12 15:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords recurrent language modelstest-time computelatent reasoningreasoning benchmarksmodel architectureinference scaling
0
0 comments X

The pith

A language model scales test-time reasoning by repeatedly applying one recurrent block in latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an architecture that iterates a single recurrent block during inference, unrolling it to any chosen depth to perform additional computation inside the model's hidden states. This differs from token-based scaling methods such as chain-of-thought, which expand output length and often require special training data. The authors train the model to 3.5 billion parameters on 800 billion tokens and report that extra iterations raise scores on reasoning benchmarks, reaching levels comparable to models with up to 50 billion parameters. A reader would care because the method promises to increase effective compute without larger models, longer contexts, or verbose outputs, and because it may support reasoning that is difficult to express in words.

Core claim

Iterating a recurrent block at test time allows the model to perform implicit reasoning steps in latent space, producing measurable gains on reasoning benchmarks that grow with the number of iterations and reach performance equivalent to a model fifty billion parameters larger.

What carries the argument

A recurrent block that is applied repeatedly at inference time, thereby unrolling the network to variable depth while operating entirely in the model's internal latent representations.

If this is right

  • Reasoning performance can be scaled at test time without increasing the number of output tokens generated.
  • The approach requires no chain-of-thought style training data or expanded context windows.
  • Types of reasoning that resist verbal description can still be captured inside the latent iterations.
  • A fixed-size model can deliver compute levels equivalent to much larger models by choosing how many iterations to run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models using this architecture could allocate compute dynamically, running more iterations only on difficult inputs.
  • The same recurrent block could be inserted into existing transformer models to add a latent-reasoning mode without full retraining.
  • If the gains hold on broader task suites, training compute could be traded for inference compute in future model design.

Load-bearing premise

Repeated applications of the same block produce genuine additional reasoning steps rather than merely adding non-informative computation or fitting to benchmark patterns.

What would settle it

If further iterations after a modest number cease to improve accuracy on held-out reasoning tasks or begin to degrade it, the claim that the iterations perform useful latent reasoning would be falsified.

read the original abstract

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a recurrent-depth language model architecture that scales test-time computation by repeatedly applying a shared recurrent block, unrolling to arbitrary depth in latent space rather than generating additional tokens. The approach requires no specialized chain-of-thought training data and works with small context windows. The authors train a 3.5B-parameter model on 800B tokens and report that increased test-time iterations yield performance gains on reasoning benchmarks, sometimes reaching levels claimed to be equivalent to a 50B-parameter model.

Significance. If the empirical results hold under proper controls, the work offers a concrete alternative to token-based test-time scaling and could enable more efficient capture of non-verbalizable reasoning steps. The scaling of the proof-of-concept to 3.5B parameters and 800B tokens demonstrates practical feasibility and provides initial evidence that recurrent unrolling can improve benchmark scores. These strengths are tempered by the absence of detailed ablations and compute-equivalence measurements in the current manuscript.

major comments (2)
  1. Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.
  2. Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.
minor comments (1)
  1. The abstract would benefit from a brief statement of the recurrent block's parameter sharing and how depth is controlled at inference time to help readers immediately distinguish the method from standard transformer scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments below and will incorporate revisions to clarify the equivalence claim, add supporting analyses, and strengthen the evidence for latent-space reasoning.

read point-by-point responses
  1. Referee: Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.

    Authors: We agree that the equivalence claim requires a precise definition and additional controls. In the revised manuscript we will explicitly define equivalence via total inference FLOPs (comparing recurrent unrolling compute to the forward pass of a 50B model), report statistical significance tests on the benchmark gains, and add ablations against non-recurrent baselines allocated identical extra compute. These changes will make the central claim transparent and reproducible. revision: yes

  2. Referee: Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.

    Authors: We acknowledge the need for stronger evidence. The revised version will include performance scaling curves versus iteration count, direct comparisons to feed-forward models matched on total FLOPs, and controls that vary depth in non-recurrent architectures while holding training data and parameters fixed. These additions will help isolate the contribution of recurrence and reduce concerns about redundancy or overfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling results with no derivation chain

full rationale

The paper introduces a recurrent-depth architecture that iterates a shared block at test time to scale compute in latent space, reporting empirical gains on reasoning benchmarks equivalent to much larger models. No equations, derivations, fitted parameters, or uniqueness theorems are presented in the provided text. The central claim rests entirely on experimental outcomes rather than any self-definitional reduction, fitted-input prediction, or load-bearing self-citation. This is the expected non-finding for an architecture paper whose value is demonstrated by benchmarks, not by mathematical construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that latent-space iteration adds useful reasoning capacity; no free parameters, invented entities, or additional axioms are visible from the abstract alone.

axioms (1)
  • domain assumption Iterating the recurrent block performs additional useful computation equivalent to deeper reasoning
    This assumption is required for the test-time scaling claim to hold but is not derived or justified in the provided abstract.

pith-pipeline@v0.9.0 · 5454 in / 1114 out tokens · 47545 ms · 2026-05-12T15:35:14.010717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stability and Generalization in Looped Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...

  2. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  3. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  4. Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics

    cs.LG 2026-05 unverdicted novelty 7.0

    Bifurcation models represent set-valued solution maps via weight-tied equilibrium dynamics whose attractors encode multiple solutions, with a proof that broad locally Lipschitz set-valued maps admit regular dynamical ...

  5. Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

    cs.LG 2026-05 conditional novelty 7.0

    Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...

  6. SMolLM: Small Language Models Learn Small Molecular Grammar

    cs.LG 2026-05 unverdicted novelty 7.0

    A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

  7. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  8. A Mechanistic Analysis of Looped Reasoning Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

  9. Training Large Language Models to Reason in a Continuous Latent Space

    cs.CL 2024-12 unverdicted novelty 7.0

    Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...

  10. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  11. Sparse Layers are Critical to Scaling Looped Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  12. Factorized Latent Reasoning for LLM-based Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    FLR factorizes latent reasoning into multiple preference factors using multi-factor attention and regularizations, outperforming baselines on recommendation benchmarks while adding robustness and interpretability.

  13. MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

    cs.CV 2026-04 unverdicted novelty 6.0

    MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.

  14. The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

    cs.CV 2026-04 unverdicted novelty 6.0

    A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.

  15. Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

  16. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  17. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  18. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.

  19. C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

    cs.LG 2026-04 unverdicted novelty 6.0

    C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...

  20. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  21. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  22. Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

    cs.LG 2026-04 conditional novelty 6.0

    LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.

  23. Dream 7B: Diffusion Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...

  24. Reasoning Primitives in Hybrid and Non-Hybrid LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.

  25. Hyperloop Transformers

    cs.LG 2026-04 unverdicted novelty 5.0

    Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.

  26. LEPO: Latent Reasoning Policy Optimization for Large Language Models

    cs.LG 2026-04 unverdicted novelty 5.0

    LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.

  27. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  28. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  29. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · cited by 28 Pith papers · 32 internal anchors

  1. [1]

    Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, and Samy Bengio. 2023. https://doi.org/10.48550/arXiv.2310.08866 Adaptivity and Modularity for Efficient Generalization Over Task Complexity . arxiv:2310.08866[cs]

  2. [2]

    AI2. 2024. https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d OLMo 1.7-- 7B : A 24 point improvement on MMLU

  3. [3]

    Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of language models: Part 3.1, knowledge storage and extraction. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pages 1067--1077, Vienna, Austria. JMLR.org

  4. [4]

    S.-I. Amari. 1972. https://doi.org/10.1109/T-C.1972.223477 Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements . IEEE Transactions on Computers, C-21(11):1197--1206

  5. [5]

    AMD. 2021. https://www.amd.com/en/products/accelerators/instinct/mi200/mi250x.html AMD Instinct ™ MI250X Accelerators

  6. [6]

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319

  7. [7]

    Zico Kolter

    Brandon Amos and J. Zico Kolter. 2017. http://proceedings.mlr.press/v70/amos17a.html OptNet : Differentiable Optimization as a Layer in Neural Networks . In International Conference on Machine Learning , pages 136--145

  8. [8]

    Zico Kolter, and Roger Baker Grosse

    Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, and Roger Baker Grosse. 2022. https://openreview.net/forum?id=kgT6D7Z4Xv9 Path Independent Equilibrium Models Can Better Exploit Test-Time Computation . In Advances in Neural Information Processing Systems

  9. [9]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732

  10. [10]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. https://openreview.net/forum?id=4WnqRR915j Llemma: An Open Language Model for Mathematics . In The Twelfth International Conference on Learning Representations

  11. [11]

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. 2024. https://doi.org/10.48550/arXiv.2410.20672 Relaxed Recursive Transformers : Effective Parameter Sharing with Layer-wise LoRA

  12. [12]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2019. https://arxiv.org/abs/1909.01377 Deep Equilibrium Models . In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc

  13. [13]

    Zico Kolter

    Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. 2022. https://openreview.net/forum?id=B0oHOwT5ENL Neural Deep Equilibrium Solvers . In International Conference on Learning Representations

  14. [14]

    Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.48550/arXiv.2408.07055 LongWriter : Unleashing 10,000+ Word Generation from Long Context LLMs . arxiv:2408.07055[cs]

  15. [15]

    Andrea Banino, Jan Balaguer, and Charles Blundell. 2021. https://openreview.net/forum?id=1EuxRTe0WN PonderNet : Learning to Ponder . In 8th ICML Workshop on Automated Machine Learning ( AutoML )

  16. [16]

    Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. 2022. https://openreview.net/forum?id=PPjSKy40XUB End-to-end Algorithm Synthesis with Recurrent Networks : Extrapolation without Overthinking . In Advances in Neural Information Processing Systems

  17. [17]

    Bauschke, Sarah M

    Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang. 2011. https://arxiv.org/abs/1101.4688 Firmly nonexpansive mappings and maximally monotone operators: Correspondence and duality . arXiv:1101.4688 [math]

  18. [18]

    Jay Bear, Adam Pr \"u gel-Bennett , and Jonathon Hare. 2024. https://doi.org/10.48550/arXiv.2410.23451 Rethinking Deep Thinking : Stable Learning of Algorithms using Lipschitz Constraints . arxiv:2410.23451[cs]

  19. [19]

    Stas Bekman. 2023. https://github.com/stas00/ml-engineering Machine Learning Engineering Open Book . Stasosphere Online Inc

  20. [20]

    Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra . 2024. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus SmolLM-corpus

  21. [21]

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal . 2023. https://doi.org/10.48550/arXiv.2304.01373 Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ...

  22. [22]

    Lessons from the trenches on reproducible evaluation of language models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, and 11 others. 2024. https://doi.org/10.48550/arXiv....

  23. [23]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

  24. [24]

    Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022. https://openaccess.thecvf.com/content/CVPR2022/html/Boudiaf\_Parameter-Free\_Online\_Test-Time\_Adaptation\_CVPR\_2022\_paper.html Parameter- Free Online Test-Time Adaptation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pages 8344--8353

  25. [25]

    Valentino Braitenberg. 1986. Vehicles: Experiments in Synthetic Psychology . MIT press

  26. [26]

    William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. 2024. https://doi.org/10.48550/arXiv.2405.12981 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention . arxiv:2405.12981[cs]

  27. [27]

    British Library Labs . 2021. https://doi.org/10.23636/r7w6-zy15 Digitised Books . c. 1510 - c. 1900. JSONL ( OCR Derived Text + Metadata) . British Library

  28. [28]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. https://openreview.net/forum?id=PEpbUobfJv Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads . In Forty-First International Conference on Machine Learning

  29. [29]

    character.ai . 2024. https://research.character.ai/optimizing-inference/ Optimizing AI Inference at Character . AI

  30. [30]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...

  31. [31]

    Jeffrey Cheng and Benjamin Van Durme. 2024. https://doi.org/10.48550/arXiv.2412.13171 Compressed Chain of Thought : Efficient Reasoning Through Dense Representations . arxiv:2412.13171[cs]

  32. [32]

    Euirim Choi. 2023. https://www.github.com/euirim/goodwiki GoodWiki dataset

  33. [33]

    Fran c ois Chollet. 2019. https://doi.org/10.48550/arXiv.1911.01547 On the Measure of Intelligence . arxiv:1911.01547[cs]

  34. [34]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, and 48 others. 2022. https://arxiv.org/abs/2204.02311 PaLM :...

  35. [35]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1

  36. [36]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://doi.org/10.48550/arXiv.2110.14168 Training Verifiers to Solve Math Word Problems . arxiv:2110.14168[cs]

  37. [37]

    Owen Colegrove, Vik Paruchuri, and OpenPhi-Team . 2024. https://huggingface.co/datasets/open-phi/textbooks Open-phi/textbooks Datasets at Hugging Face

  38. [38]

    R \'o bert Csord \'a s, Kazuki Irie, J \"u rgen Schmidhuber, Christopher Potts, and Christopher D. Manning. 2024. https://openreview.net/forum?id=ZxVrkm7Bjl&noteId=xzoi2mTLOI MoEUT : Mixture-of-Experts Universal Transformers . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  39. [39]

    Gautier Dagan. 2024. https://github.com/gautierdag/bpeasy Bpeasy

  40. [40]

    Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozi \`e re. 2024. https://arxiv.org/abs/2402.01035 Getting the most out of your tokenizer for pre-training and domain adaptation . arxiv:2402.01035[cs]

  41. [41]

    Tri Dao. 2023. https://doi.org/10.48550/arXiv.2307.08691 FlashAttention-2 : Faster Attention with Better Parallelism and Work Partitioning . arxiv:2307.08691[cs]

  42. [42]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. https://doi.org/10.48550/arXiv.2205.14135 FlashAttention : Fast and Memory-Efficient Exact Attention with IO-Awareness . arxiv:2205.14135[cs]

  43. [43]

    DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://doi.org/10.48550/arXiv.2501.12948 DeepSeek-R1 : Incentivizing Reasoning Capability in LLMs via Reinfo...

  44. [44]

    DeepSeek-AI , Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2024. https://doi.org/10.48550/arXiv.2412.19437 DeepSeek-V3 Technical Report . arxiv:2412.19437[cs]

  45. [45]

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. 2019. https://doi.org/10.48550/arXiv.1807.03819 Universal Transformers . arxiv:1807.03819[cs, stat]

  46. [46]

    Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://doi.org/10.48550/arXiv.2405.14838 From Explicit CoT to Implicit CoT : Learning to Internalize CoT Step by Step . arxiv:2405.14838[cs]

  47. [47]

    Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. 2024. https://openreview.net/forum?id=kRxCDDFNpp Fewer Truncations Improve Language Modeling . In Forty-First International Conference on Machine Learning

  48. [48]

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. https://proceedings.neurips.cc/paper/2021/hash/a4d92e2cd541fca87e4620aba658316d-Abstract.html CogView : Mastering Text-to-Image Generation via Transformers . In Advances in Neural Information Processing Systems , volume 34...

  49. [49]

    Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2019. https://openreview.net/forum?id=SJg7KhVKPH Depth- Adaptive Transformer . In International Conference on Learning Representations

  50. [50]

    Aly, Beidi Chen, and Carole-Jean Wu

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A. Aly, Beidi Chen, and Carole-Jean Wu. 2024. https://doi.org/10.48550/arXiv.2404.16710 LayerSkip : Enabling Early Exit Inference and Self-Speculative Decoding . arxiv:2404.16710[cs]

  51. [51]

    and Novak, Roman and Liu, Peter J

    Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein , Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. 2024. https://doi.org/10.48550/arXiv.2407.05872 Scaling Exponents Across Parameterizations and Optimizers . arxiv:2407.05872[cs]

  52. [52]

    Angela Fan, Edouard Grave, and Armand Joulin. 2019. https://doi.org/10.48550/arXiv.1909.11556 Reducing Transformer Depth on Demand with Structured Dropout . arxiv:1909.11556[cs, stat]

  53. [53]

    Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2021. https://doi.org/10.48550/arXiv.2002.09402 Addressing Some Limitations of Transformers with Feedback Memory . arxiv:2002.09402[cs, stat]

  54. [54]

    Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. 2025. https://openreview.net/forum?id=2edigk8yoU Looped Transformers for Length Generalization . In The Thirteenth International Conference on Learning Representations

  55. [55]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. https://doi.org/10.48550/arXiv.2101.03961 Switch Transformers : Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . arxiv:2101.03961[cs]

  56. [56]

    Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. https://proceedings.neurips.cc/paper\_files/paper/2023/hash/16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets\_and\_Benchmarks.html ChessGPT : Bridging Policy Learning and Language Modeling . Advances in Neural Information Processing Syst...

  57. [57]

    Sebastian Gabarain. 2024. https://huggingface.co/datasets/Locutusque/hercules-v5.0 Locutusque/hercules-v5.0 Datasets at Hugging Face

  58. [58]

    Reddi, Stefanie Jegelka, and Sanjiv Kumar

    Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. 2024. https://doi.org/10.48550/arXiv.2410.08292 Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning ?

  59. [59]

    Jonas Geiping and Tom Goldstein. 2023. https://proceedings.mlr.press/v202/geiping23a.html Cramming: Training a Language Model on a single GPU in one day. In Proceedings of the 40th International Conference on Machine Learning , pages 11117--11143. PMLR

  60. [60]

    Jonas Geiping and Michael Moeller. 2019. https://arxiv.org/abs/1908.06209 Parametric Majorization for Data-Driven Energy Minimization Methods . In Proceedings of the IEEE International Conference on Computer Vision , pages 10262--10273

  61. [61]

    Gers and J

    F.A. Gers and J. Schmidhuber. 2000. https://doi.org/10.1109/IJCNN.2000.861302 Recurrent nets that time and count . In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks . IJCNN 2000. Neural Computing : New Challenges and Perspectives for the New Millennium , volume 3, pages 189--194 vol.3

  62. [62]

    Lee, and Dimitris Papailiopoulos

    Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. 2023. https://proceedings.mlr.press/v202/giannou23a.html Looped Transformers as Programmable Computers . In Proceedings of the 40th International Conference on Machine Learning , pages 11398--11442. PMLR

  63. [63]

    Priya Goyal, Piotr Doll \'a r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. https://arxiv.org/abs/1706.02677 Accurate, Large Minibatch SGD : Training ImageNet in 1 Hour . arxiv:1706.02677[cs]

  64. [64]

    Alex Graves. 2017. https://doi.org/10.48550/arXiv.1603.08983 Adaptive Computation Time for Recurrent Neural Networks . arxiv:1603.08983[cs]

  65. [65]

    Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. https://arxiv.org/abs/1410.5401 Neural Turing Machines . arxiv:1410.5401[cs]

  66. [66]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.48550/arXiv.2402.00838 OLMo : A...

  67. [67]

    Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. 2024. https://openreview.net/forum?id=ompl7supoX&referrer=\ In Workshop on Efficient Systems for Foundation Models II @ ICML2024

  68. [68]

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. https://doi.org/10.48550/arXiv.2412.06769 Training Large Language Models to Reason in a Continuous Latent Space . arxiv:2412.06769[cs]

  69. [69]

    Tamir David Hay and Lior Wolf. 2023. https://openreview.net/forum?id=d4uL2MSe0z Dynamic Layer Tying for Parameter-Efficient Transformers . In The Twelfth International Conference on Learning Representations

  70. [70]

    Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. 2024. https://doi.org/10.48550/arXiv.2402.13449 CAMELoT : Towards Large Language Models with Training-Free Consolidated Associative Memory . arxiv:2402.13449[cs]

  71. [71]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

  72. [72]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

  73. [73]

    J J Hopfield. 1982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/ Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8):2554--2558

  74. [74]

    Jiewen Hu, Thomas Zhu, and Sean Welleck. 2024. https://doi.org/10.48550/arXiv.2408.03350 miniCTX : Neural Theorem Proving with ( Long- ) Contexts . arxiv:2408.03350[cs]

  75. [75]

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. http://www.scopus.com/inward/record.url?scp=85059432227&partnerID=8YFLogxK Averaging weights leads to wider optima and better generalization: 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 . 34th Conference on Uncertainty in Artific...

  76. [76]

    Jiang, Wenda Li, and Mateja Jamnik

    Albert Q. Jiang, Wenda Li, and Mateja Jamnik. 2023. https://doi.org/10.48550/arXiv.2311.03755 Multilingual Mathematical Autoformalization . arxiv:2311.03755[cs]

  77. [77]

    Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions

  78. [78]

    Jean Kaddour. 2022. https://doi.org/10.48550/arXiv.2209.14981 Stop Wasting My Time ! Saving Days of ImageNet and BERT Training with Latest Weight Averaging . arxiv:2209.14981[cs, stat]

  79. [79]

    Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2024. https://doi.org/10.48550/arXiv.2410.05864 From Tokens to Words : On the Inner Lexicon of LLMs . arxiv:2410.05864[cs]

  80. [80]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://doi.org/10.48550/arXiv.2001.08361 Scaling Laws for Neural Language Models . arxiv:2001.08361[cs, stat]

Showing first 80 references.