Recognition: 2 theorem links
· Lean TheoremScaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Pith reviewed 2026-05-12 15:35 UTC · model grok-4.3
The pith
A language model scales test-time reasoning by repeatedly applying one recurrent block in latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Iterating a recurrent block at test time allows the model to perform implicit reasoning steps in latent space, producing measurable gains on reasoning benchmarks that grow with the number of iterations and reach performance equivalent to a model fifty billion parameters larger.
What carries the argument
A recurrent block that is applied repeatedly at inference time, thereby unrolling the network to variable depth while operating entirely in the model's internal latent representations.
If this is right
- Reasoning performance can be scaled at test time without increasing the number of output tokens generated.
- The approach requires no chain-of-thought style training data or expanded context windows.
- Types of reasoning that resist verbal description can still be captured inside the latent iterations.
- A fixed-size model can deliver compute levels equivalent to much larger models by choosing how many iterations to run.
Where Pith is reading between the lines
- Models using this architecture could allocate compute dynamically, running more iterations only on difficult inputs.
- The same recurrent block could be inserted into existing transformer models to add a latent-reasoning mode without full retraining.
- If the gains hold on broader task suites, training compute could be traded for inference compute in future model design.
Load-bearing premise
Repeated applications of the same block produce genuine additional reasoning steps rather than merely adding non-informative computation or fitting to benchmark patterns.
What would settle it
If further iterations after a modest number cease to improve accuracy on held-out reasoning tasks or begin to degrade it, the claim that the iterations perform useful latent reasoning would be falsified.
read the original abstract
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a recurrent-depth language model architecture that scales test-time computation by repeatedly applying a shared recurrent block, unrolling to arbitrary depth in latent space rather than generating additional tokens. The approach requires no specialized chain-of-thought training data and works with small context windows. The authors train a 3.5B-parameter model on 800B tokens and report that increased test-time iterations yield performance gains on reasoning benchmarks, sometimes reaching levels claimed to be equivalent to a 50B-parameter model.
Significance. If the empirical results hold under proper controls, the work offers a concrete alternative to token-based test-time scaling and could enable more efficient capture of non-verbalizable reasoning steps. The scaling of the proof-of-concept to 3.5B parameters and 800B tokens demonstrates practical feasibility and provides initial evidence that recurrent unrolling can improve benchmark scores. These strengths are tempered by the absence of detailed ablations and compute-equivalence measurements in the current manuscript.
major comments (2)
- Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.
- Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.
minor comments (1)
- The abstract would benefit from a brief statement of the recurrent block's parameter sharing and how depth is controlled at inference time to help readers immediately distinguish the method from standard transformer scaling.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comments below and will incorporate revisions to clarify the equivalence claim, add supporting analyses, and strengthen the evidence for latent-space reasoning.
read point-by-point responses
-
Referee: Abstract and experimental results section: the central claim that recurrent iterations achieve performance 'equivalent to a 50 billion parameter model' is load-bearing for the paper's contribution, yet the manuscript provides no explicit definition or measurement protocol for this equivalence (e.g., total FLOPs, wall-clock time, or parameter-equivalent compute), no statistical significance tests, and no ablations against non-recurrent baselines that receive the same additional compute budget.
Authors: We agree that the equivalence claim requires a precise definition and additional controls. In the revised manuscript we will explicitly define equivalence via total inference FLOPs (comparing recurrent unrolling compute to the forward pass of a 50B model), report statistical significance tests on the benchmark gains, and add ablations against non-recurrent baselines allocated identical extra compute. These changes will make the central claim transparent and reproducible. revision: yes
-
Referee: Method and experimental sections: the claim that unrolling the recurrent block performs 'genuine additional reasoning in latent space' rather than redundant computation or benchmark overfitting requires supporting evidence such as scaling curves across iteration counts, comparisons to equivalent-FLOP feed-forward models, and controls that isolate the effect of recurrence from simple extra depth or training artifacts.
Authors: We acknowledge the need for stronger evidence. The revised version will include performance scaling curves versus iteration count, direct comparisons to feed-forward models matched on total FLOPs, and controls that vary depth in non-recurrent architectures while holding training data and parameters fixed. These additions will help isolate the contribution of recurrence and reduce concerns about redundancy or overfitting. revision: yes
Circularity Check
No significant circularity; empirical scaling results with no derivation chain
full rationale
The paper introduces a recurrent-depth architecture that iterates a shared block at test time to scale compute in latent space, reporting empirical gains on reasoning benchmarks equivalent to much larger models. No equations, derivations, fitted parameters, or uniqueness theorems are presented in the provided text. The central claim rests entirely on experimental outcomes rather than any self-definitional reduction, fitted-input prediction, or load-bearing self-citation. This is the expected non-finding for an architecture paper whose value is demonstrated by benchmarks, not by mathematical construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Iterating the recurrent block performs additional useful computation equivalent to deeper reasoning
Lean theorems connected to this paper
-
Foundation/EightTick.lean or DimensionForcing.leaneight_tick_forces_D3 or alexander_duality_circle_linking echoesOur model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens.
Forward citations
Cited by 29 Pith papers
-
Stability and Generalization in Looped Transformers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics
Bifurcation models represent set-valued solution maps via weight-tied equilibrium dynamics whose attractors encode multiple solutions, with a proof that broad locally Lipschitz set-valued maps admit regular dynamical ...
-
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
-
SMolLM: Small Language Models Learn Small Molecular Grammar
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
A Mechanistic Analysis of Looped Reasoning Language Models
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
-
Training Large Language Models to Reason in a Continuous Latent Space
Coconut lets LLMs perform reasoning directly in continuous latent space by recycling hidden states as inputs, outperforming standard chain-of-thought on search-intensive logical tasks with better accuracy-efficiency t...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
Factorized Latent Reasoning for LLM-based Recommendation
FLR factorizes latent reasoning into multiple preference factors using multi-factor attention and regularizations, outperforming baselines on recommendation benchmarks while adding robustness and interpretability.
-
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
-
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to stochastic latent representations in LLMs via Gumbel-Softmax to support diverse reasoning paths and unified optimization.
-
C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outp...
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
-
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
-
Hyperloop Transformers
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
-
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO applies RL to continuous latent representations in LLMs by injecting Gumbel-Softmax stochasticity for diverse trajectory sampling and unified gradient estimation, outperforming existing discrete and latent RL methods.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Reference graph
Works this paper leans on
-
[1]
Samira Abnar, Omid Saremi, Laurent Dinh, Shantel Wilson, Miguel Angel Bautista, Chen Huang, Vimal Thilak, Etai Littwin, Jiatao Gu, Josh Susskind, and Samy Bengio. 2023. https://doi.org/10.48550/arXiv.2310.08866 Adaptivity and Modularity for Efficient Generalization Over Task Complexity . arxiv:2310.08866[cs]
-
[2]
AI2. 2024. https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d OLMo 1.7-- 7B : A 24 point improvement on MMLU
work page 2024
-
[3]
Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of language models: Part 3.1, knowledge storage and extraction. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pages 1067--1077, Vienna, Austria. JMLR.org
work page 2024
-
[4]
S.-I. Amari. 1972. https://doi.org/10.1109/T-C.1972.223477 Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements . IEEE Transactions on Computers, C-21(11):1197--1206
-
[5]
AMD. 2021. https://www.amd.com/en/products/accelerators/instinct/mi200/mi250x.html AMD Instinct ™ MI250X Accelerators
work page 2021
- [6]
-
[7]
Brandon Amos and J. Zico Kolter. 2017. http://proceedings.mlr.press/v70/amos17a.html OptNet : Differentiable Optimization as a Layer in Neural Networks . In International Conference on Machine Learning , pages 136--145
work page 2017
-
[8]
Zico Kolter, and Roger Baker Grosse
Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, and Roger Baker Grosse. 2022. https://openreview.net/forum?id=kgT6D7Z4Xv9 Path Independent Equilibrium Models Can Better Exploit Test-Time Computation . In Advances in Neural Information Processing Systems
work page 2022
-
[9]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Jiang, Jia Deng, Stella Biderman, and Sean Welleck
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. https://openreview.net/forum?id=4WnqRR915j Llemma: An Open Language Model for Mathematics . In The Twelfth International Conference on Learning Representations
work page 2023
-
[11]
Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. 2024. https://doi.org/10.48550/arXiv.2410.20672 Relaxed Recursive Transformers : Effective Parameter Sharing with Layer-wise LoRA
-
[12]
Zico Kolter, and Vladlen Koltun
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2019. https://arxiv.org/abs/1909.01377 Deep Equilibrium Models . In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc
-
[13]
Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. 2022. https://openreview.net/forum?id=B0oHOwT5ENL Neural Deep Equilibrium Solvers . In International Conference on Learning Representations
work page 2022
-
[14]
Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.48550/arXiv.2408.07055 LongWriter : Unleashing 10,000+ Word Generation from Long Context LLMs . arxiv:2408.07055[cs]
-
[15]
Andrea Banino, Jan Balaguer, and Charles Blundell. 2021. https://openreview.net/forum?id=1EuxRTe0WN PonderNet : Learning to Ponder . In 8th ICML Workshop on Automated Machine Learning ( AutoML )
work page 2021
-
[16]
Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. 2022. https://openreview.net/forum?id=PPjSKy40XUB End-to-end Algorithm Synthesis with Recurrent Networks : Extrapolation without Overthinking . In Advances in Neural Information Processing Systems
work page 2022
-
[17]
Heinz H. Bauschke, Sarah M. Moffat, and Xianfu Wang. 2011. https://arxiv.org/abs/1101.4688 Firmly nonexpansive mappings and maximally monotone operators: Correspondence and duality . arXiv:1101.4688 [math]
-
[18]
Jay Bear, Adam Pr \"u gel-Bennett , and Jonathon Hare. 2024. https://doi.org/10.48550/arXiv.2410.23451 Rethinking Deep Thinking : Stable Learning of Algorithms using Lipschitz Constraints . arxiv:2410.23451[cs]
-
[19]
Stas Bekman. 2023. https://github.com/stas00/ml-engineering Machine Learning Engineering Open Book . Stasosphere Online Inc
work page 2023
-
[20]
Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra . 2024. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus SmolLM-corpus
work page 2024
-
[21]
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal . 2023. https://doi.org/10.48550/arXiv.2304.01373 Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ...
-
[22]
Lessons from the trenches on reproducible evaluation of language models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, and 11 others. 2024. https://doi.org/10.48550/arXiv....
-
[23]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence
work page 2020
-
[24]
Malik Boudiaf, Romain Mueller, Ismail Ben Ayed, and Luca Bertinetto. 2022. https://openaccess.thecvf.com/content/CVPR2022/html/Boudiaf\_Parameter-Free\_Online\_Test-Time\_Adaptation\_CVPR\_2022\_paper.html Parameter- Free Online Test-Time Adaptation . In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition , pages 8344--8353
work page 2022
-
[25]
Valentino Braitenberg. 1986. Vehicles: Experiments in Synthetic Psychology . MIT press
work page 1986
-
[26]
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. 2024. https://doi.org/10.48550/arXiv.2405.12981 Reducing Transformer Key-Value Cache Size with Cross-Layer Attention . arxiv:2405.12981[cs]
-
[27]
British Library Labs . 2021. https://doi.org/10.23636/r7w6-zy15 Digitised Books . c. 1510 - c. 1900. JSONL ( OCR Derived Text + Metadata) . British Library
-
[28]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. https://openreview.net/forum?id=PEpbUobfJv Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads . In Forty-First International Conference on Machine Learning
work page 2024
-
[29]
character.ai . 2024. https://research.character.ai/optimizing-inference/ Optimizing AI Inference at Character . AI
work page 2024
-
[30]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large lang...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Jeffrey Cheng and Benjamin Van Durme. 2024. https://doi.org/10.48550/arXiv.2412.13171 Compressed Chain of Thought : Efficient Reasoning Through Dense Representations . arxiv:2412.13171[cs]
-
[32]
Euirim Choi. 2023. https://www.github.com/euirim/goodwiki GoodWiki dataset
work page 2023
-
[33]
Fran c ois Chollet. 2019. https://doi.org/10.48550/arXiv.1911.01547 On the Measure of Intelligence . arxiv:1911.01547[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.01547 2019
-
[34]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, and 48 others. 2022. https://arxiv.org/abs/2204.02311 PaLM :...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://doi.org/10.48550/arXiv.2110.14168 Training Verifiers to Solve Math Word Problems . arxiv:2110.14168[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168 2021
-
[37]
Owen Colegrove, Vik Paruchuri, and OpenPhi-Team . 2024. https://huggingface.co/datasets/open-phi/textbooks Open-phi/textbooks Datasets at Hugging Face
work page 2024
-
[38]
R \'o bert Csord \'a s, Kazuki Irie, J \"u rgen Schmidhuber, Christopher Potts, and Christopher D. Manning. 2024. https://openreview.net/forum?id=ZxVrkm7Bjl¬eId=xzoi2mTLOI MoEUT : Mixture-of-Experts Universal Transformers . In The Thirty-eighth Annual Conference on Neural Information Processing Systems
work page 2024
-
[39]
Gautier Dagan. 2024. https://github.com/gautierdag/bpeasy Bpeasy
work page 2024
- [40]
-
[41]
Tri Dao. 2023. https://doi.org/10.48550/arXiv.2307.08691 FlashAttention-2 : Faster Attention with Better Parallelism and Work Partitioning . arxiv:2307.08691[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691 2023
-
[42]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. https://doi.org/10.48550/arXiv.2205.14135 FlashAttention : Fast and Memory-Efficient Exact Attention with IO-Awareness . arxiv:2205.14135[cs]
work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
-
[43]
DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://doi.org/10.48550/arXiv.2501.12948 DeepSeek-R1 : Incentivizing Reasoning Capability in LLMs via Reinfo...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[44]
DeepSeek-AI , Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2024. https://doi.org/10.48550/arXiv.2412.19437 DeepSeek-V3 Technical Report . arxiv:2412.19437[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[45]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. 2019. https://doi.org/10.48550/arXiv.1807.03819 Universal Transformers . arxiv:1807.03819[cs, stat]
work page internal anchor Pith review doi:10.48550/arxiv.1807.03819 2019
-
[46]
Yuntian Deng, Yejin Choi, and Stuart Shieber. 2024. https://doi.org/10.48550/arXiv.2405.14838 From Explicit CoT to Implicit CoT : Learning to Internalize CoT Step by Step . arxiv:2405.14838[cs]
-
[47]
Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. 2024. https://openreview.net/forum?id=kRxCDDFNpp Fewer Truncations Improve Language Modeling . In Forty-First International Conference on Machine Learning
work page 2024
-
[48]
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. 2021. https://proceedings.neurips.cc/paper/2021/hash/a4d92e2cd541fca87e4620aba658316d-Abstract.html CogView : Mastering Text-to-Image Generation via Transformers . In Advances in Neural Information Processing Systems , volume 34...
work page 2021
-
[49]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2019. https://openreview.net/forum?id=SJg7KhVKPH Depth- Adaptive Transformer . In International Conference on Learning Representations
work page 2019
-
[50]
Aly, Beidi Chen, and Carole-Jean Wu
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A. Aly, Beidi Chen, and Carole-Jean Wu. 2024. https://doi.org/10.48550/arXiv.2404.16710 LayerSkip : Enabling Early Exit Inference and Self-Speculative Decoding . arxiv:2404.16710[cs]
-
[51]
and Novak, Roman and Liu, Peter J
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein , Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. 2024. https://doi.org/10.48550/arXiv.2407.05872 Scaling Exponents Across Parameterizations and Optimizers . arxiv:2407.05872[cs]
-
[52]
Angela Fan, Edouard Grave, and Armand Joulin. 2019. https://doi.org/10.48550/arXiv.1909.11556 Reducing Transformer Depth on Demand with Structured Dropout . arxiv:1909.11556[cs, stat]
-
[53]
Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2021. https://doi.org/10.48550/arXiv.2002.09402 Addressing Some Limitations of Transformers with Feedback Memory . arxiv:2002.09402[cs, stat]
-
[54]
Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. 2025. https://openreview.net/forum?id=2edigk8yoU Looped Transformers for Length Generalization . In The Thirteenth International Conference on Learning Representations
work page 2025
-
[55]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. https://doi.org/10.48550/arXiv.2101.03961 Switch Transformers : Scaling to Trillion Parameter Models with Simple and Efficient Sparsity . arxiv:2101.03961[cs]
work page internal anchor Pith review doi:10.48550/arxiv.2101.03961 2022
-
[56]
Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. https://proceedings.neurips.cc/paper\_files/paper/2023/hash/16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets\_and\_Benchmarks.html ChessGPT : Bridging Policy Learning and Language Modeling . Advances in Neural Information Processing Syst...
work page 2023
-
[57]
Sebastian Gabarain. 2024. https://huggingface.co/datasets/Locutusque/hercules-v5.0 Locutusque/hercules-v5.0 Datasets at Hugging Face
work page 2024
-
[58]
Reddi, Stefanie Jegelka, and Sanjiv Kumar
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. 2024. https://doi.org/10.48550/arXiv.2410.08292 Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning ?
-
[59]
Jonas Geiping and Tom Goldstein. 2023. https://proceedings.mlr.press/v202/geiping23a.html Cramming: Training a Language Model on a single GPU in one day. In Proceedings of the 40th International Conference on Machine Learning , pages 11117--11143. PMLR
work page 2023
- [60]
-
[61]
F.A. Gers and J. Schmidhuber. 2000. https://doi.org/10.1109/IJCNN.2000.861302 Recurrent nets that time and count . In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks . IJCNN 2000. Neural Computing : New Challenges and Perspectives for the New Millennium , volume 3, pages 189--194 vol.3
-
[62]
Lee, and Dimitris Papailiopoulos
Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. 2023. https://proceedings.mlr.press/v202/giannou23a.html Looped Transformers as Programmable Computers . In Proceedings of the 40th International Conference on Machine Learning , pages 11398--11442. PMLR
work page 2023
-
[63]
Priya Goyal, Piotr Doll \'a r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. https://arxiv.org/abs/1706.02677 Accurate, Large Minibatch SGD : Training ImageNet in 1 Hour . arxiv:1706.02677[cs]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[64]
Alex Graves. 2017. https://doi.org/10.48550/arXiv.1603.08983 Adaptive Computation Time for Recurrent Neural Networks . arxiv:1603.08983[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1603.08983 2017
-
[65]
Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. https://arxiv.org/abs/1410.5401 Neural Turing Machines . arxiv:1410.5401[cs]
work page internal anchor Pith review arXiv 2014
-
[66]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.48550/arXiv.2402.00838 OLMo : A...
-
[67]
Alexander H \"a gele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, and Martin Jaggi. 2024. https://openreview.net/forum?id=ompl7supoX&referrer=\ In Workshop on Efficient Systems for Foundation Models II @ ICML2024
work page 2024
-
[68]
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. https://doi.org/10.48550/arXiv.2412.06769 Training Large Language Models to Reason in a Continuous Latent Space . arxiv:2412.06769[cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.06769 2024
-
[69]
Tamir David Hay and Lior Wolf. 2023. https://openreview.net/forum?id=d4uL2MSe0z Dynamic Layer Tying for Parameter-Efficient Transformers . In The Twelfth International Conference on Learning Representations
work page 2023
-
[70]
Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. 2024. https://doi.org/10.48550/arXiv.2402.13449 CAMELoT : Towards Large Language Models with Training-Free Consolidated Associative Memory . arxiv:2402.13449[cs]
-
[71]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)
work page 2021
-
[72]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations
work page 2021
-
[73]
J J Hopfield. 1982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/ Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8):2554--2558
work page 1982
-
[74]
Jiewen Hu, Thomas Zhu, and Sean Welleck. 2024. https://doi.org/10.48550/arXiv.2408.03350 miniCTX : Neural Theorem Proving with ( Long- ) Contexts . arxiv:2408.03350[cs]
-
[75]
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. http://www.scopus.com/inward/record.url?scp=85059432227&partnerID=8YFLogxK Averaging weights leads to wider optima and better generalization: 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018 . 34th Conference on Uncertainty in Artific...
work page 2018
-
[76]
Jiang, Wenda Li, and Mateja Jamnik
Albert Q. Jiang, Wenda Li, and Mateja Jamnik. 2023. https://doi.org/10.48550/arXiv.2311.03755 Multilingual Mathematical Autoformalization . arxiv:2311.03755[cs]
-
[77]
Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions
work page 2017
-
[78]
Jean Kaddour. 2022. https://doi.org/10.48550/arXiv.2209.14981 Stop Wasting My Time ! Saving Days of ImageNet and BERT Training with Latest Weight Averaging . arxiv:2209.14981[cs, stat]
-
[79]
Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2024. https://doi.org/10.48550/arXiv.2410.05864 From Tokens to Words : On the Inner Lexicon of LLMs . arxiv:2410.05864[cs]
-
[80]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. https://doi.org/10.48550/arXiv.2001.08361 Scaling Laws for Neural Language Models . arxiv:2001.08361[cs, stat]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.