pith. machine review for the scientific record. sign in

arxiv: 2510.04871 · v1 · submitted 2025-10-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Less is More: Recursive Reasoning with Tiny Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tiny recursive modelrecursive reasoningARC-AGIparameter efficiencysmall neural networkspuzzle solvinggeneralizationiterative refinement
0
0 comments X

The pith

A two-layer recursive network with 7 million parameters reaches 45 percent accuracy on ARC-AGI-1, surpassing most large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Tiny Recursive Model as a minimal recursive approach to reasoning. TRM applies one small two-layer network repeatedly to refine its answer on each input. Trained on roughly one thousand examples, it records 45 percent test accuracy on ARC-AGI-1 and 8 percent on ARC-AGI-2. These scores exceed those reported for several much larger models while using far fewer parameters. The result indicates that iterative refinement by a tiny network can substitute for scale on certain hard puzzle tasks.

Core claim

TRM consists of a single tiny neural network with only two layers that recurses on the current state of a puzzle to produce successive refinements until a solution emerges. When trained on approximately one thousand examples, this model reaches 45 percent accuracy on the ARC-AGI-1 test set and 8 percent on ARC-AGI-2, outperforming the earlier Hierarchical Reasoning Model and most cited large language models that contain thousands of times more parameters.

What carries the argument

The Tiny Recursive Model (TRM), a single two-layer network that iterates refinement steps on the input through repeated application.

If this is right

  • Recursive iteration on a fixed small network produces measurable gains on visual reasoning benchmarks without added parameters.
  • Training sets of roughly one thousand examples suffice for nontrivial generalization on ARC-style tasks when recursion is used.
  • A single-network recursive design can exceed the performance of an earlier two-network hierarchical design on the same puzzles.
  • High accuracy on Sudoku, Maze, and ARC-AGI is achievable with total model size under 10 million parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recursive loop could be tested on other step-by-step tasks such as theorem proving or program synthesis.
  • If recursion depth can be learned or scheduled, further reductions in required parameters may be possible.
  • The approach invites direct comparisons of iteration count versus parameter count across a wider range of benchmarks.

Load-bearing premise

The accuracy numbers obtained by TRM can be compared directly to the accuracies reported for the much larger language models despite differences in training data and evaluation protocols.

What would settle it

Re-evaluate the trained TRM on a fresh ARC-AGI test split whose puzzle distributions differ markedly from the original training set and check whether accuracy on ARC-AGI-1 falls below 25 percent.

read the original abstract

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tiny Recursive Model (TRM), a simplified single-network recursive reasoning architecture with only 2 layers and 7M parameters. It claims TRM, trained on around 1000 examples, reaches 45% test accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, outperforming most LLMs (Deepseek R1, o3-mini, Gemini 2.5 Pro) while using <0.01% of their parameters, and improves upon the prior Hierarchical Reasoning Model (HRM) for Sudoku, Maze, and ARC-AGI tasks.

Significance. If the empirical claims are verified with rigorous controls, the result would be significant: it would show that biologically inspired recursive reasoning in tiny networks can deliver strong generalization on hard puzzle benchmarks using orders-of-magnitude less data and compute than LLMs, offering a concrete counter-example to pure scaling and opening avenues for efficient, interpretable reasoning systems.

major comments (2)
  1. Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.
  2. Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.
minor comments (1)
  1. Abstract: the statement 'higher than most LLMs' should be qualified with the exact subset of models and conditions under which the comparison holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on evaluation clarity. We have revised the manuscript to explicitly describe the data partitioning, confirm disjoint splits, and distinguish the training regimes from LLM baselines.

read point-by-point responses
  1. Referee: Abstract: the central performance claims (45% ARC-AGI-1, 8% ARC-AGI-2) are stated without any description of the train/eval/test partitioning, whether the ~1000 training examples are strictly disjoint from the reported test sets, or confirmation of no leakage from ARC training tasks; this directly undermines the generalization interpretation relative to zero-shot LLM baselines.

    Authors: We agree that the original abstract omitted these details. In the revised version we have updated the abstract and added a dedicated paragraph in Section 3 to state: TRM is trained on approximately 1000 examples drawn from the public ARC-AGI training tasks and evaluated on the official test set, which consists of entirely disjoint tasks never seen during training. We explicitly confirm no leakage occurs because the test tasks are held out and the model never accesses ARC test data or private splits during any stage of training or validation. revision: yes

  2. Referee: Abstract and §1: the comparison to LLMs (o3-mini, Gemini 2.5 Pro, etc.) is presented as direct superiority, yet TRM receives task-specific gradient updates on ~1000 examples while the cited LLMs are evaluated zero- or few-shot; no section clarifies that the evaluation regimes are equivalent, making the parameter-efficiency claim load-bearing but currently unsupported.

    Authors: We thank the referee for noting the regime difference. Our comparison is deliberately between a task-specifically trained tiny model and zero/few-shot LLMs to illustrate parameter and data efficiency. We have revised the abstract and Section 1 to explicitly state that TRM receives gradient updates on ~1000 task-specific examples while the cited LLMs are evaluated without any ARC-AGI fine-tuning. This clarification makes the efficiency claim precise rather than claiming identical protocols; the result still shows that a 7 M-parameter model trained on limited data can exceed the performance of much larger models used in their standard inference setting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims

full rationale

The paper reports test accuracies for the proposed TRM architecture on ARC-AGI benchmarks. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-referential equations are present. Central results are direct measurements of generalization on held-out tasks rather than reductions to inputs by construction. Self-citations (if any) to prior HRM work are not load-bearing for any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on the abstract; no mathematical derivations, free parameters, or axioms are specified.

pith-pipeline@v0.9.0 · 5464 in / 1058 out tokens · 57015 ms · 2026-05-15T04:49:25.470167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stability and Generalization in Looped Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...

  2. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  3. A Mechanistic Analysis of Looped Reasoning Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

  4. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...

  5. The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

    cs.CV 2026-04 unverdicted novelty 6.0

    A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.

  6. Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

  7. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  8. LASER: Low-Rank Activation SVD for Efficient Recursion

    cs.LG 2026-04 unverdicted novelty 6.0

    LASER tracks low-rank activation subspaces in recursive models via matrix-free SVD updates and fidelity resets to save 60% memory without accuracy loss.

  9. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  10. Querying Structured Data Through Natural Language Using Language Models

    cs.CL 2026-04 conditional novelty 6.0

    Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.

  11. Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

    cs.LG 2026-04 unverdicted novelty 6.0

    Fast-slow recurrence interleaves quick latent updates with slow observation processing to maintain coherent clustered representations over long horizons, improving out-of-distribution generalization versus LSTM, state...

  12. bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

    cs.CV 2026-05 unverdicted novelty 5.0

    A 12-step single-block recurrent ViT-B reaches accuracy comparable to a standard ViT-B on ImageNet-1K while using an order of magnitude fewer parameters.

  13. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  14. State Representation and Termination for Recursive Reasoning Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...

  15. Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency

    cs.LG 2026-04 unverdicted novelty 5.0

    KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.

  16. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap between consolidation and expansion operators as a real-time, trajectory-based signal for convergence and principled stopping in adaptive learning.

  17. Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

    cs.LG 2026-05 unverdicted novelty 4.0

    OpMech defines the order-gap as a computable non-commutativity measure between consolidation and expansion operators to provide real-time convergence signals and stopping rules in adaptive learning.

  18. Measuring AI Reasoning: A Guide for Researchers

    cs.AI 2026-05 unverdicted novelty 4.0

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  19. Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

    cs.CL 2026-04 unverdicted novelty 4.0

    Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.

  20. LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems

    cs.AI 2026-04 unverdicted novelty 4.0

    LIFE is a proposed agentic framework that combines four components to enable incremental, flexible, and energy-efficient continual learning for HPC operations such as latency spike mitigation.

  21. S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning

    cs.NE 2026-05 unverdicted novelty 3.0

    S-AI-Recursive operationalizes reasoning as a closed-loop hormonal iteration with Clarifine and Confusionin to reach stable equilibrium, achieving competitive benchmark performance with under 10 million parameters via...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 20 Pith papers · 11 internal anchors

  1. [1]

    The Hidden Drivers of HRM’s Performance on ARC-AGI

    ARC Prize Foundation. The Hidden Drivers of HRM’s Performance on ARC-AGI. https://arcprize. org/blog/hrm-analysis, 2025a. [Online; ac- cessed 2025-09-15]. ARC Prize Foundation. ARC-AGI Leaderboard. https://arcprize.org/leaderboard, 2025b. [Online; accessed 2025-09-24]. Bai, S., Kolter, J. Z., and Koltun, V . Deep equilibrium models.Advances in neural info...

  2. [2]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthe- sis.arXiv preprint arXiv:1809.11096,

  3. [3]

    On the Measure of Intelligence

    Chollet, F. On the measure of intelligence.arXiv preprint arXiv:1911.01547,

  4. [4]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831,

  5. [5]

    and Kolter, J

    Geng, Z. and Kolter, J. Z. Torchdeq: A library for deep equilibrium models.arXiv preprint arXiv:2310.18605,

  6. [6]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

  7. [7]

    Hierarchical graph generation with k2-trees

    Jang, Y., Kim, D., and Ahn, S. Hierarchical graph generation with k2-trees. InICML 2023 Workshop on Structured Probabilistic Inference Generative Modeling,

  8. [8]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  9. [9]

    Kingma, D. P . and Ba, J. Adam: A method for stochas- tic optimization.arXiv preprint arXiv:1412.6980,

  10. [10]

    Beyond a*: Better planning with transformers via search dynamics bootstrap- ping.arXiv preprint arXiv:2402.14083,

    9 Recursive Reasoning with Tiny Networks Lehnert, L., Sukhbaatar, S., Su, D., Zheng, Q., Mcvay, P ., Rabbat, M., and Tian, Y. Beyond a*: Better planning with transformers via search dynamics bootstrap- ping.arXiv preprint arXiv:2402.14083,

  11. [11]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  12. [12]

    V ., and Mitchell, M

    Moskvichev, A., Odouard, V . V ., and Mitchell, M. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain.arXiv preprint arXiv:2305.07141,

  13. [13]

    A., and Birdal, T

    Prieto, L., Barsbey, M., Mediano, P . A., and Birdal, T. Grokking at the edge of numerical stability.arXiv preprint arXiv:2501.04697,

  14. [14]

    GLU Variants Improve Transformer

    Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  15. [15]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neu- ral networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

  16. [16]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-time compute optimally can be more effec- tive than scaling model parameters.arXiv preprint arXiv:2408.03314,

  17. [17]

    Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., and Yadkori, Y. A. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734,

  18. [18]

    10 Recursive Reasoning with Tiny Networks Hyper-parameters and setup All models are trained with the AdamW opti- mizer(Loshchilov & Hutter, 2017; Kingma & Ba,

  19. [19]

    TRM uses an Exponential Moving Average (EMA) of 0.999

    for improved stability. TRM uses an Exponential Moving Average (EMA) of 0.999. HRM uses n= 2, T= 2 with two 4-layers networks, while we usen=6,T=3 with one 2-layer network. For Sudoku-Extreme and Maze-Hard, we train for 60k epochs with learning rate 1e-4 and weight decay 1.0. For ARC-AGI, we train for 100K epochs with learning rate 1e-4 (with 1e-2 learnin...

  20. [20]

    This would provide a better justification for the 1-step gradient approximation

    to replace the recursion steps by fixed-point iteration as done by Deep Equilibrium Models (Bai et al., 2019). This would provide a better justification for the 1-step gradient approximation. However, this slowed down training due to the fixed-point iteration and led to worse generalization. This highlights the fact that converging to a fixed-point is not...