A 2x2 ablation shows repeated shared access enables grokking while addressable memory (not recurrence) enables edit propagation in transformer variants on synthetic KG QA.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow halt equilibrium the model cannot escape. Inverting the bias to -3 ("deep start") eliminates the failure mode, and ablation shows the trap is inherent to ACT initialization rather than an artifact of our architecture. With reliable training, ACT yields an order of magnitude lower seed variance than fixed-depth processing (+/-0.7 vs +/-9.3 pp); lambda warmup recovers 34% of compute at matched accuracy; and attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code: https://github.com/che-shr-cat/utm-jax.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Compressed looped Transformers with fixed small recurrent state cannot decide P-complete problems under logspace reductions, while polynomial-length chain-of-thought can.
citing papers explorer
-
Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory
A 2x2 ablation shows repeated shared access enables grokking while addressable memory (not recurrence) enables edit propagation in transformer variants on synthetic KG QA.