hub

Neural Turing Machines

Alex Graves, Greg Wayne, Ivo Danihelka · 2014 · cs.NE · arXiv 1410.5401

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

open full Pith review browse 28 citing papers arXiv PDF

abstract

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Categorical Reparameterization with Gumbel-Softmax

stat.ML · 2016-11-03 · unverdicted · novelty 8.0

Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.

Adaptive Computation Time for Recurrent Neural Networks

cs.NE · 2016-03-29 · accept · novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

cs.CV · 2026-05-13 · accept · novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policies or rewards, yielding longer episodes in tested environments.

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.

Neural Information Causality

quant-ph · 2026-05-10 · unverdicted · novelty 7.0

Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Concrete Problems in AI Safety

cs.AI · 2016-06-21 · accept · novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.

Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Frozen text-pretrained transformer weights transfer across modalities through a thin interface, achieving SOTA on a robotic task and parity on decision-making with far fewer trainable parameters.

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

cs.LG · 2026-04-23 · conditional · novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

cs.LG · 2021-04-27 · accept · novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

Universal Transformers

cs.CL · 2018-07-10 · unverdicted · novelty 6.0

Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

TIDE: Every Layer Knows the Token Beneath the Context

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

cs.LG · 2026-05-06 · unverdicted · novelty 5.0 · 2 refs

FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.

citing papers explorer

Showing 28 of 28 citing papers.

Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 88 · internal anchor
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 35 · internal anchor
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 15 · internal anchor
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 5 · internal anchor
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 9 · internal anchor
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Categorical Reparameterization with Gumbel-Softmax stat.ML · 2016-11-03 · unverdicted · none · ref 4 · internal anchor
Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
Adaptive Computation Time for Recurrent Neural Networks cs.NE · 2016-03-29 · accept · none · ref 10 · internal anchor
ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.
Does Engram Do Memory Retrieval in Autoregressive Image Generation? cs.CV · 2026-05-13 · accept · none · ref 5 · internal anchor
Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.
Intrinsic Vicarious Conditioning for Deep Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policies or rewards, yielding longer episodes in tested environments.
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 33 · internal anchor
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
Neural Information Causality quant-ph · 2026-05-10 · unverdicted · none · ref 19 · internal anchor
Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval stat.ML · 2026-05-06 · unverdicted · none · ref 32 · internal anchor
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
Screening Is Enough cs.LG · 2026-04-01 · unverdicted · none · ref 25 · internal anchor
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 65 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 65 · internal anchor
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory cs.LG · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities cs.LG · 2026-05-01 · unverdicted · none · ref 8 · internal anchor
Frozen text-pretrained transformer weights transfer across modalities through a thin interface, achieving SOTA on a robotic task and parity on decision-making with far fewer trainable parameters.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning cs.LG · 2026-04-23 · conditional · none · ref 19 · internal anchor
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents cs.CL · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 34 · internal anchor
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 13 · internal anchor
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 81 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation cs.LG · 2026-05-06 · unverdicted · none · ref 6 · 2 links · internal anchor
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.
Graph Memory Transformer (GMT) cs.LG · 2026-04-26 · unverdicted · none · ref 15 · internal anchor
Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M dense baseline in perplexity.
Neural Computers cs.LG · 2026-04-07 · unverdicted · none · ref 11 · internal anchor
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making cs.LG · 2026-04-08 · unverdicted · none · ref 8 · internal anchor
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
A PyTorch Library of Turing-Complete Neural Networks cs.LG · 2026-05-03 · unverdicted · none · ref 2 · internal anchor
A PyTorch package constructs neural networks that exactly simulate given Turing machines using transformer and recurrent architectures derived from prior theoretical results.

Neural Turing Machines

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer