pith. sign in

hub Canonical reference

Neural Turing Machines

Canonical reference. 82% of citing Pith papers cite this work as background.

54 Pith papers citing it
Background 82% of classified citations
abstract

We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.

hub tools

citation-role summary

background 11

citation-polarity summary

polarities

background 9 unclear 2

representative citing papers

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

REALM: Retrieval-Augmented Language Model Pre-Training

cs.CL · 2020-02-10 · accept · novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

Adaptive Computation Time for Recurrent Neural Networks

cs.NE · 2016-03-29 · accept · novelty 8.0

ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policies or rewards, yielding longer episodes in tested environments.

Neural Information Causality

quant-ph · 2026-05-10 · unverdicted · novelty 7.0

Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Augmenting Self-attention with Persistent Memory

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

Concrete Problems in AI Safety

cs.AI · 2016-06-21 · accept · novelty 7.0

The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

Task-Focused Memorization for Multimodal Agents

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.

citing papers explorer

Showing 50 of 54 citing papers.

  • Risks from Learned Optimization in Advanced Machine Learning Systems cs.AI · 2019-06-05 · accept · none · ref 14 · internal anchor

    Mesa-optimization arises when learned models act as optimizers with objectives that can differ from their training loss, creating alignment risks in advanced machine learning.

  • Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 88 · internal anchor

    NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

  • On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 35 · internal anchor

    Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

  • RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 15 · internal anchor

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 5 · internal anchor

    Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

  • Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 9 · internal anchor

    Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

  • REALM: Retrieval-Augmented Language Model Pre-Training cs.CL · 2020-02-10 · accept · none · ref 5 · internal anchor

    REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

  • Categorical Reparameterization with Gumbel-Softmax stat.ML · 2016-11-03 · unverdicted · none · ref 4 · internal anchor

    Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.

  • Adaptive Computation Time for Recurrent Neural Networks cs.NE · 2016-03-29 · accept · none · ref 10 · internal anchor

    ACT lets RNNs dynamically adapt computation depth per input via a differentiable halting unit, yielding large gains on synthetic tasks and structural insights on language data.

  • Does Engram Do Memory Retrieval in Autoregressive Image Generation? cs.CV · 2026-05-13 · accept · none · ref 5 · internal anchor

    Engram in AR image generation saves backbone FLOPs but trails pure AR baselines in FID and behaves as a gated side-pathway rather than a content-addressed retriever.

  • Intrinsic Vicarious Conditioning for Deep Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 27 · internal anchor

    Vicarious conditioning is proposed as a new intrinsic reward in RL that implements attention, retention, reproduction, and reinforcement via memory methods to enable low-shot learning from others without their policies or rewards, yielding longer episodes in tested environments.

  • On the Importance of Multistability for Horizon Generalization in Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 33 · internal anchor

    Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.

  • Neural Information Causality quant-ph · 2026-05-10 · unverdicted · none · ref 19 · internal anchor

    Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.

  • Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval stat.ML · 2026-05-06 · unverdicted · none · ref 32 · internal anchor

    Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

  • Screening Is Enough cs.LG · 2026-04-01 · unverdicted · none · ref 25 · internal anchor

    Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 65 · internal anchor

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  • Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 13 · internal anchor

    Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

  • Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 34 · internal anchor

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  • Augmenting Self-attention with Persistent Memory cs.LG · 2019-07-02 · unverdicted · none · ref 15 · internal anchor

    Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

  • Concrete Problems in AI Safety cs.AI · 2016-06-21 · accept · none · ref 65 · internal anchor

    The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

  • Task-Focused Memorization for Multimodal Agents cs.CV · 2026-05-29 · unverdicted · none · ref 21 · internal anchor

    TaskMem uses RL in two phases to learn a task-focused memorization policy for multimodal agents, yielding 5.3-7.0% VQA accuracy gains on reformulated streaming benchmarks from VideoMME, EgoLife, and EgoTempo.

  • Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory cs.LG · 2026-05-13 · unverdicted · none · ref 6 · internal anchor

    PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.

  • The Position Curse: LLMs Struggle to Locate the Last Few Items in a List cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor

    LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.

  • Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning cs.LG · 2026-04-23 · conditional · none · ref 19 · internal anchor

    Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

  • BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning cs.RO · 2026-03-12 · unverdicted · none · ref 13 · internal anchor

    BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.

  • On the Spatiotemporal Dynamics of Generalization in Neural Networks cs.LG · 2026-02-02 · unverdicted · none · ref 36 · internal anchor

    Deriving a neural cellular automaton from locality, symmetry, and stability postulates produces 100% accurate addition generalization from 16-digit to 1-million-digit inputs.

  • Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling cs.LG · 2025-08-22 · unverdicted · none · ref 25 · internal anchor

    In a cellular automata rule-inference task designed to block memorization, neural models achieve high next-step accuracy but accuracy falls sharply with longer reasoning chains; depth, recurrence, memory, and test-time compute extend the reachable depth but do not remove the bound.

  • MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 30 · internal anchor

    MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.

  • Titans: Learning to Memorize at Test Time cs.LG · 2024-12-31 · unverdicted · none · ref 38 · internal anchor

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  • Solving math word problems with process- and outcome-based feedback cs.LG · 2022-11-25 · unverdicted · none · ref 17 · internal anchor

    On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.

  • Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 34 · internal anchor

    Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

  • Compressive Transformers for Long-Range Sequence Modelling cs.LG · 2019-11-13 · unverdicted · none · ref 51 · internal anchor

    Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

  • Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling cs.PL · 2019-07-14 · unverdicted · none · ref 7 · internal anchor

    Bayesian synthesis formulates automatic construction of probabilistic programs in PCFG-specified DSLs with soundness conditions, enabling structure analysis and prediction that outperforms baselines on real datasets.

  • Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 13 · internal anchor

    Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

  • Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents cs.AI · 2026-05-17 · unverdicted · none · ref 12 · internal anchor

    A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.

  • TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 81 · internal anchor

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

  • FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation cs.LG · 2026-05-06 · unverdicted · none · ref 6 · 2 links · internal anchor

    FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.

  • Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B cs.LG · 2026-05-01 · unverdicted · none · ref 4 · 2 links · internal anchor

    Four heads (L26.28, L27.28, L27.2, L27.3) in frozen Gemma 4 31B exhibit joint high importance on text and non-text tasks with hypergeometric significance (P=0.0013) and causal validation on a cube task.

  • Neural Computers cs.LG · 2026-04-07 · unverdicted · none · ref 11 · internal anchor

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.

  • MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning cs.AI · 2025-05-31 · unverdicted · none · ref 25 · internal anchor

    MIRROR applies cognitive principles of parallel processing, reconstructive synthesis, and complementary learning to AI, yielding 21% relative gains on multi-turn constraint-maintenance tasks across seven models with supporting ablations.

  • MemoryBank: Enhancing Large Language Models with Long-Term Memory cs.CL · 2023-05-17 · unverdicted · none · ref 4 · internal anchor

    MemoryBank equips LLMs with long-term memory using Ebbinghaus-inspired updates, allowing recall and personality adaptation in chatbots like SiliconFriend.

  • Sketch of a novel approach to a neural model q-bio.NC · 2022-09-14 · unverdicted · none · ref 35 · internal anchor

    The paper sketches a neuron-centric model of neuroplasticity that separates neural transmission from internal signal selection and storage within each neuron rather than relying solely on synaptic weights.

  • A Neural Turing~Machine for Conditional Transition Graph Modeling cs.AI · 2019-07-15 · unverdicted · none · ref 3 · internal anchor

    The CNTM extends NTM to model conditional transition graphs and reproduces paths with accuracies from 82.12% on 10-node graphs to 65.25% on 100-node graphs.

  • Click-Through Rate Prediction with the User Memory Network cs.IR · 2019-07-09 · unverdicted · none · ref 7 · internal anchor

    MA-DNN augments DNNs with per-user memory vectors capturing likes and dislikes to exploit historical behavior for CTR prediction while remaining simpler than RNNs.

  • Understanding Memory Modules on Learning Simple Algorithms cs.LG · 2019-07-01 · unverdicted · none · ref 2 · internal anchor

    NTM and stack-augmented networks both generalize on sequence reversal but only the stack model succeeds on arithmetic expressions by monitoring different input categories and applying distinct memory-update policies.

  • ARMIN: Towards a More Efficient and Light-weight Recurrent Memory Network cs.LG · 2019-06-28 · unverdicted · none · ref 9 · internal anchor

    ARMIN introduces auto-addressing via hidden states and a novel RNN cell to produce a lighter recurrent memory network with lower overhead than existing MANNs or vanilla LSTMs.

  • Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making cs.LG · 2026-04-08 · unverdicted · none · ref 8 · internal anchor

    An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.

  • Incremental Concept Learning via Online Generative Memory Recall cs.LG · 2019-07-05 · unverdicted · none · ref 10 · internal anchor

    Pseudo-rehearsal method with cGAN-generated old-concept samples, balanced online recall, and concept contrastive loss for class-incremental learning on MNIST, Fashion-MNIST and SVHN.

  • S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative, Introspective, and Energy-Frugal Reasoning cs.NE · 2026-05-05 · unverdicted · none · ref 21 · internal anchor

    S-AI-Recursive operationalizes reasoning as a closed-loop hormonal iteration with Clarifine and Confusionin to reach stable equilibrium, achieving competitive benchmark performance with under 10 million parameters via temporal depth instead of width.

  • A PyTorch Library of Turing-Complete Neural Networks cs.LG · 2026-05-03 · unverdicted · none · ref 2 · internal anchor

    A PyTorch package constructs neural networks that exactly simulate given Turing machines using transformer and recurrent architectures derived from prior theoretical results.