Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

· 2026 · cs.LG · arXiv 2604.21999

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to -3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed-depth processing (+/-0.7 vs +/-9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che-shr-cat/utm-jax.

representative citing papers

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

A 2x2 ablation shows repeated shared access enables grokking while addressable memory (not recurrence) enables edit propagation in transformer variants on synthetic KG QA.

Chain-of-Thought and Compressed Looped Transformers: A Memory-Budget Separation

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

Compressed looped Transformers with fixed small recurrent state cannot decide P-complete problems under logspace reductions, while polynomial-length chain-of-thought can.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory cs.AI · 2026-06-17 · unverdicted · none · ref 32 · internal anchor
A 2x2 ablation shows repeated shared access enables grokking while addressable memory (not recurrence) enables edit propagation in transformer variants on synthetic KG QA.

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer