hub

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves · 2016 · cs.NE · arXiv 1603.08983

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.

Muninn: Your Trajectory Diffusion Model But Faster

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

cs.CL · 2026-05-10 · conditional · novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

cs.IR · 2026-04-21 · unverdicted · novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.

Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.

Depth Adaptive Efficient Visual Autoregressive Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

A Mechanistic Analysis of Looped Reasoning Language Models

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

Gated Subspace Inference for Transformer Acceleration

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.

LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.

State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.

Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

cs.LG · 2026-04-23 · conditional · novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

cs.LG · 2026-04-16 · accept · novelty 6.0 · 2 refs

A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.

Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.

Relational Preference Encoding in Looped Transformer Internal States

cs.LG · 2026-04-10 · conditional · novelty 6.0

Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own value system.

Emergent Abilities of Large Language Models

cs.CL · 2022-06-15 · unverdicted · novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

citing papers explorer

Showing 31 of 31 citing papers.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
Stability and Generalization in Looped Transformers cs.LG · 2026-04-16 · unverdicted · none · ref 7 · internal anchor
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 4 · internal anchor
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 8 · internal anchor
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling cs.LG · 2026-05-11 · unverdicted · none · ref 92 · internal anchor
LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.
Muninn: Your Trajectory Diffusion Model But Faster cs.RO · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models cs.CL · 2026-05-10 · conditional · none · ref 36 · internal anchor
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction cs.IR · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras cs.CV · 2026-04-20 · unverdicted · none · ref 25 · internal anchor
A single attention-based model trained on synthetic wide-baseline event data achieves zero-shot feature matching across unseen datasets with a reported 37.7% improvement over prior event matching methods.
Depth Adaptive Efficient Visual Autoregressive Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 21 · internal anchor
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Mechanistic Analysis of Looped Reasoning Language Models cs.LG · 2026-04-13 · unverdicted · none · ref 12 · internal anchor
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 64 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation cs.LG · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
Gated Subspace Inference for Transformer Acceleration cs.LG · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference cs.LG · 2026-05-01 · unverdicted · none · ref 19 · internal anchor
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 47 · internal anchor
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement cs.LG · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning cs.LG · 2026-04-23 · conditional · none · ref 7 · internal anchor
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
Dispatch-Aware Ragged Attention for Pruned Vision Transformers cs.LG · 2026-04-16 · accept · none · ref 13 · 2 links · internal anchor
A new Triton kernel for dispatch-aware ragged attention delivers 1.88-2.51× end-to-end throughput gains over standard padded attention and 9-12% over FlashAttention-2 varlen in pruned ViTs by lowering dispatch floor to ~24μs.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 21 · internal anchor
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
Relational Preference Encoding in Looped Transformer Internal States cs.LG · 2026-04-10 · conditional · none · ref 6 · internal anchor
Looped transformer hidden states encode preferences relationally via pairwise differences rather than independent pointwise classification, with the evaluator acting as an internal consistency probe on the model's own value system.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 30 · internal anchor
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking cs.CV · 2026-05-07 · unverdicted · none · ref 39 · internal anchor
A three-stage ViT with sparsity-aware MoE and adaptive inference depth delivers improved accuracy-efficiency trade-off for event-stream visual tracking on FE240hz, COESOT, and EventVOT benchmarks.
Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers cs.LG · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
A monotone head-gating mechanism conditions transformer attention on a budget, enabling one checkpoint to trade attention cost for accuracy and produce measured CPU speedups.
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study cs.LG · 2026-04-19 · conditional · none · ref 3 · internal anchor
Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA-guided gate over a simple MLP gate.
Adaptive Computation Depth via Learned Token Routing in Transformers cs.LG · 2026-04-18 · unverdicted · none · ref 1 · internal anchor
TSA adds end-to-end differentiable per-token halting gates to transformers, enabling learned adaptive depth that saves 14-23% token-layer operations with under 0.5% quality loss on language modeling.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 75 · 2 links · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction cs.CV · 2026-05-05 · unverdicted · none · ref 18 · internal anchor
RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.
ITS-Mina: A Harris Hawks Optimization-Based All-MLP Framework with Iterative Refinement and External Attention for Multivariate Time Series Forecasting cs.LG · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
ITS-Mina introduces an all-MLP model with iterative refinement, external attention via learnable memory units, and HHO-tuned dropout that reports state-of-the-art or competitive results on six multivariate time series benchmarks versus eleven baselines.

Adaptive Computation Time for Recurrent Neural Networks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer