Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
hub Canonical reference
Adaptive Computation Time for Recurrent Neural Networks
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Neural networks are redefined as continuous dynamical systems by learning the derivative of the hidden state with a neural network and integrating it with an ODE solver.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Interaction locality is introduced as a task-geometry-aware measurement framework showing that high-level states in recursive models write locally while recursive updates build broader structures on maze, Sudoku, ARC-AGI, and 3D grounding tasks.
LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Introduces looped transformer architectures for world models that iteratively refine latent states to achieve up to 100x parameter efficiency via adaptive computation depth.
TIDE is a neuro-inspired architecture using stabilized asymmetric E-I networks with lateral inhibition and 80:20 balance that trains in under half the time of CTM while gaining +1.65% top-1 accuracy on perturbed ImageNet.
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
In a minimal two-state recurrent Transformer, asymmetric input injection induces stable specialization where one state becomes a committed proposal and the other retains shifting uncertainty.
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
citing papers explorer
-
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
-
Stability and Generalization in Looped Transformers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
-
CanViT: Toward Active-Vision Foundation Models
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
Neural Ordinary Differential Equations
Neural networks are redefined as continuous dynamical systems by learning the derivative of the hidden state with a neural network and integrating it with an ODE solver.
-
An Attribute-Based Measure of Video Complexity
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
-
Training-Free Looped Transformers
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
-
Interaction Locality in Hierarchical Recursive Reasoning
Interaction locality is introduced as a task-geometry-aware measurement framework showing that high-level states in recursive models write locally while recursive updates build broader structures on maze, Sudoku, ARC-AGI, and 3D grounding tasks.
-
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.
-
Muninn: Your Trajectory Diffusion Model But Faster
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras
A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
A Mechanistic Analysis of Looped Reasoning Language Models
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Looped World Models
Introduces looped transformer architectures for world models that iteratively refine latent states to achieve up to 100x parameter efficiency via adaptive computation depth.
-
TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics
TIDE is a neuro-inspired architecture using stabilized asymmetric E-I networks with lateral inhibition and 80:20 balance that trains in under half the time of CTM while gaining +1.65% top-1 accuracy on perturbed ImageNet.
-
Generative Recursive Reasoning
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
-
One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer
In a minimal two-state recurrent Transformer, asymmetric input injection induces stable specialization where one state becomes a committed proposal and the other retains shifting uncertainty.
-
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
-
Gated Subspace Inference for Transformer Acceleration
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
-
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
-
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
-
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
Dispatch-Aware Ragged Attention for Pruned Vision Transformers
A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
Relational Preference Encoding in Looped Transformer Internal States
Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own value system.
-
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
-
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models
AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
-
Entropy After </Think> for reasoning model early exiting
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.
-
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Mixture-of-Depths enables transformers to dynamically allocate compute by routing only the top-k tokens through each layer's full computations, matching baseline performance with a fraction of the FLOPs per forward pass and up to 50% faster sampling.
-
Solving math word problems with process- and outcome-based feedback
On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
PRISM: Progressive Reasoning through Iterative Slot Memory for Vision
PRISM is a pyramid vision architecture using iterative slot memory for progressive reasoning that reports competitive performance on classification, detection, and segmentation with improved robustness to occlusions.
-
Probabilistic Tiny Recursive Model
PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.
-
Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
A three-stage ViT with sparsity-aware MoE and adaptive inference depth delivers improved accuracy-efficiency trade-off for event-stream visual tracking on FE240hz, COESOT, and EventVOT benchmarks.
-
Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers
A monotone head-gating mechanism conditions transformer attention on a budget, enabling one checkpoint to trade attention cost for accuracy and produce measured CPU speedups.
-
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA-guided gate over a simple MLP gate.
-
Adaptive Computation Depth via Learned Token Routing in Transformers
TSA adds end-to-end differentiable per-token halting gates to transformers, enabling learned adaptive depth that saves 14-23% token-layer operations with under 0.5% quality loss on language modeling.
-
Hierarchical Reasoning Model
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.
-
ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting
ITS-Mina introduces an all-MLP model with iterative refinement, external attention via learnable memory units, and HHO-tuned dropout that reports state-of-the-art or competitive results on six multivariate time series benchmarks versus eleven baselines.