hub Canonical reference

Neural Turing Machines

Alex Graves, Greg Wayne, Ivo Danihelka · 2014 · cs.NE · arXiv 1410.5401

Canonical reference. 82% of citing Pith papers cite this work as background.

64 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 64 citing papers arXiv PDF

abstract

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11

citation-polarity summary

background 9 unclear 2

representative citing papers

Risks from Learned Optimization in Advanced Machine Learning Systems

cs.AI · 2019-06-05 · accept · novelty 9.0

Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

REALM: Retrieval-Augmented Language Model Pre-Training

cs.CL · 2020-02-10 · accept · novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

Categorical Reparameterization with Gumbel-Softmax

stat.ML · 2016-11-03 · unverdicted · novelty 8.0

Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.

Adaptive Computation Time for Recurrent Neural Networks

cs.NE · 2016-03-29 · accept · novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

State commitment learning: training language models to distinguish computation from memory

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces state commitment learning and Counterfactual Erasure RL (CERL) to train models to commit only persistent state, reducing answer dependence on hidden thoughts across math, logic, QA, and tool-use tasks without accuracy loss.

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

cs.CV · 2026-05-13 · accept · novelty 7.0

Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policies or rewards, yielding longer episodes in tested environments.

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.

Neural Information Causality

quant-ph · 2026-05-10 · unverdicted · novelty 7.0

Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.

Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

Graph Memory Transformer (GMT)

cs.LG · 2026-04-26 · unverdicted · novelty 7.0

Graph Memory Transformer replaces FFN sublayers with a graph memory cell using 128 centroids and transition matrices per block, yielding stable training at 82.2M parameters but higher validation loss than a 103M dense baseline.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Augmenting Self-attention with Persistent Memory

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

Concrete Problems in AI Safety

cs.AI · 2016-06-21 · accept · novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

BiDeMem: Bidirectional Degradation Memory for Explainable Image Restoration

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

BiDeMem retrieves compact memory slots via a query from restoration features to jointly improve restoration quality and provide a falsifiable degradation explanation path in a controlled NAFNet multi-degradation setting.

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

AGCLR extends CoCoNuT with a gated concept stream for persistent memory to fix fact loss in latent reasoning, yielding improvements on reasoning benchmarks as depth increases.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Risks from Learned Optimization in Advanced Machine Learning Systems cs.AI · 2019-06-05 · accept · none · ref 14 · internal anchor
Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.
Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 65 · internal anchor
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning cs.AI · 2026-06-05 · unverdicted · none · ref 26 · internal anchor
AGCLR extends CoCoNuT with a gated concept stream for persistent memory to fix fact loss in latent reasoning, yielding improvements on reasoning benchmarks as depth increases.
eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion cs.AI · 2026-06-01 · unverdicted · none · ref 29 · internal anchor
eMoT treats reasoning trajectories as dynamic memories with corrosion, symbolic Python anchoring, and consistency refinement, raising accuracy on Game of 24 to 100% and improving math benchmarks over CoT baselines with a lightweight model.
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents cs.AI · 2026-05-17 · unverdicted · none · ref 12 · internal anchor
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning cs.AI · 2025-05-31 · unverdicted · none · ref 25 · internal anchor
MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.
A Neural Turing~Machine for Conditional Transition Graph Modeling cs.AI · 2019-07-15 · unverdicted · none · ref 3 · internal anchor
The CNTM extends NTM to model conditional transition graphs and reproduces paths with accuracies from 82.12% on 10-node graphs to 65.25% on 100-node graphs.

Neural Turing Machines

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer