Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Antonio Torralba; Daniel Karl I. Weidele; Kabir Swain; Mauro Martino; Sijie Han

arxiv: 2605.27686 · v1 · pith:5JQ7EVSVnew · submitted 2026-05-26 · 💻 cs.CV · cs.AI

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Kabir Swain , Sijie Han , Daniel Karl I. Weidele , Mauro Martino , Antonio Torralba This is my paper

Pith reviewed 2026-06-29 18:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords recurrent memorytransformer augmentationvideo understandingspatial statefixed-size memorylong-horizon reasoningvoxel gridsoft write

0 comments

The pith

A fixed-size 3D memory tensor lets Transformers keep constant state capacity across arbitrarily long image or video sequences while retaining spatial structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tensor Memory as a lightweight add-on to Transformer blocks. Tokens write into a voxel grid through a soft Gaussian deposit around a predicted location, the grid updates via local interactions and gated recurrence, and tokens read back through continuous sampling with residual fusion. Because the tensor size never grows with sequence length, the approach separates memory capacity from input duration. A sympathetic reader would care because standard attention and KV caches scale memory with length and lose explicit spatial persistence, which limits long video reasoning and occlusion handling. The module integrates into existing pipelines without other changes and is tested on language, image, video, and diagnostic tasks.

Core claim

Tensor Memory augments Transformer blocks with a constant-size recurrent 3D memory tensor whose content is written by differentiable soft deposits of Gaussian volumes at predicted continuous 3D locations, updated by local interaction operators and gated recurrent dynamics, and read by continuous sampling with gated residual fusion; the fixed size therefore decouples state capacity from sequence length while preserving spatial inductive bias.

What carries the argument

The fixed-size recurrent 3D memory tensor that receives soft Gaussian-weighted writes at predicted continuous locations, applies local interaction and gated recurrence, and supplies context via continuous reads with residual fusion.

If this is right

Memory footprint stays constant even when input sequences become arbitrarily long.
The module can be inserted into or removed from standard Transformer blocks without altering other architecture or training code.
Spatial inductive bias is retained because writes and reads operate on a 3D grid rather than a flat token list.
The same module works for language, static images, and video without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The continuous 3D location prediction may allow the memory to represent objects that move smoothly between discrete voxel centers, something a fixed grid without soft writes would lose.
Because reads use gated residual fusion, the memory can act as an optional spatial prior rather than a mandatory replacement for attention.
The design could be tested on tasks that require explicit 3D world modeling, such as multi-view reconstruction from video, even though the paper evaluates only 2D benchmarks.

Load-bearing premise

That the soft Gaussian write combined with gated updates can keep the fixed 3D grid carrying useful spatial details across many time steps without quick loss or blurring.

What would settle it

Measure whether models using the module maintain higher accuracy than baseline Transformers on long video sequences that require tracking the same objects through repeated occlusions; equal or worse performance would falsify the claim of useful persistent spatial state.

Figures

Figures reproduced from arXiv: 2605.27686 by Antonio Torralba, Daniel Karl I. Weidele, Kabir Swain, Mauro Martino, Sijie Han.

**Figure 1.** Figure 1: Tensor Memory: a fixed-size 3D voxel state with continuous writes, local recurrent updates, and continuous reads. (1) The memory is a pair of volumes ht, ct ∈ R C×D×H×W (hidden and cell states); each of the D×H×W voxels stores a C-dimensional feature vector and the spatial layout is fixed and constant in input length. (2) Each step t emits a write package (µ (t) write, content(t) , σ(t) ) that deposits a G… view at source ↗

**Figure 2.** Figure 2: Tensor Memory module, end-to-end per chunk. (1) The chunk t consists of K token embeddings x (t) 1 , . . . , x (t) K . (2) We form two summaries of the chunk: a read query x (t) read = x (t) 1 (first token) and a write summary x (t) write = Wwp ⃗x(t) 1:K (a learned projection over the whole chunk). (3) A shared coordinate head with tied weights produces the read coordinate µ (t) read and the write coordina… view at source ↗

**Figure 3.** Figure 3: Tensor Memory in action, illustrated on three input chunks. (1) The input is partitioned into chunks; here, each chunk is a small group of patches over a single frame, drawn in a distinct colour. (2) For each chunk i, the write head predicts a continuous 3D coordinate µ (i) write ∈ [−1, 1]3 in the voxel memory; sphere radius is proportional to the predicted Gaussian write spread σ (i) . (3) Before each wri… view at source ↗

read the original abstract

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tensor Memory gives Transformers a fixed-size 3D voxel recurrent state with Gaussian soft writes and gated updates, which by construction keeps capacity constant and adds spatial bias, though the abstract supplies no numbers to show whether it actually retains useful information over long sequences.

read the letter

The paper's core move is to attach a fixed-size 3D memory tensor to Transformer blocks. Tokens predict continuous 3D locations, write content as Gaussian-weighted volumes into a voxel grid, update via local interaction and gated recurrence, then read back with continuous sampling and residual fusion. This keeps the state size independent of sequence length while retaining an explicit spatial layout.

The design is new in its specific combination of continuous location prediction, differentiable Gaussian soft writes, and gated fusion on a 3D grid. It directly targets the two problems stated in the abstract: growing memory with length and missing persistent spatial state. The claim that capacity decouples from input length holds by construction, as the stress-test note observes.

What is missing is any concrete evidence. The abstract mentions evaluations on language, image, and video benchmarks plus a toy diagnostic suite, but supplies no accuracy numbers, ablations, or comparisons against KV caching or other recurrent baselines. Without those, it is impossible to judge whether the soft-write operator and gated dynamics actually prevent rapid degradation or loss of detail across long horizons, which is the weakest assumption.

The local interaction operator and continuous sampling are described at a high level only. If the full paper contains controlled experiments that isolate when the persistent state helps, especially on occlusion or long-video tasks, that would strengthen the case. As presented, the work is an architecture proposal rather than a validated result.

This is aimed at people working on video transformers or long-horizon vision models who want a spatial memory primitive that does not grow with tokens. It deserves peer review because it offers a concrete module for a known limitation and integrates without other changes, even if the experiments will need close scrutiny.

Referee Report

0 major / 2 minor

Summary. The paper introduces Tensor Memory, a module augmenting Transformer blocks with a fixed-size recurrent 3D memory tensor. Tokens perform differentiable soft writes that deposit content as Gaussian-weighted volumes around predicted continuous 3D locations into a voxel grid; the memory is updated via an efficient local interaction operator and gated recurrent dynamics; tokens read context via continuous sampling with gated residual fusion. The fixed tensor size decouples state capacity from input length while retaining a spatial inductive bias. The module is evaluated on language, image, and video benchmarks plus a controlled toy diagnostic suite for isolating persistent-state benefits, and integrates with standard training pipelines without other architectural changes.

Significance. The core architectural property—that a constant-size 3D tensor guarantees capacity independent of sequence length—is achieved by construction and directly addresses the stated limitations of growing KV caches and absent explicit spatial state. The toy diagnostic suite is a positive design choice for testing when persistent state matters. If the empirical results on the benchmarks confirm stable long-horizon retention without rapid degradation, the module would be a practical, attachable addition for video and long-context tasks.

minor comments (2)

[Abstract] The abstract states that the module 'can be attached to or removed from existing blocks without other architectural changes,' but does not specify the exact insertion points or any required hyper-parameter retuning; a short paragraph in §3 or §4 clarifying this would improve reproducibility.
[Abstract] Notation for the soft-write operator, local interaction, and gated dynamics is introduced in the abstract without symbols; defining them with consistent symbols in the methods section would aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of the fixed-size 3D memory property, and recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core statement that a fixed-size 3D memory tensor decouples capacity from sequence length follows directly from the explicit design choice of constant tensor size and is presented as a descriptive property of the module rather than a derived prediction or theorem. No equations, fitted parameters, self-citations, or uniqueness claims are invoked in the abstract or described architecture to support the claim; the spatial bias is likewise supplied by the voxel grid and continuous read/write operations by construction. The derivation chain is self-contained with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the module description implies unstated choices such as grid resolution and Gaussian width that would need to be treated as design decisions in any implementation.

pith-pipeline@v0.9.1-grok · 5726 in / 1216 out tokens · 21433 ms · 2026-06-29T18:06:39.073175+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AURA: Action-Gated Memory for Robot Policies at Constant VRAM
cs.AI 2026-06 unverdicted novelty 7.0

AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.

Reference graph

Works this paper leans on

40 extracted references · 39 canonical work pages · cited by 1 Pith paper · 29 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

Bertasius, G., Wang, H., and Torresani, L. Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

work page arXiv
[4]

Improving language models by retrieving from trillions of tokens

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models 8 Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers by retrieving from trillions of tokens.arXiv preprint arXiv:2112.04426,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Bulatov, A., Kuratov, Y ., and Burtsev, M. S. Recurrent memory transformer.arXiv preprint arXiv:2207.06881,

work page arXiv
[6]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. Technical Report arXiv:2601.07372, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Rethinking Attention with Performers

Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Dai, Z., Yang, Z., Yang, Y ., Carbonell, J., Le, Q. V ., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. Technical Report arXiv:1901.02860, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Neural Turing Machines

Graves, A., Wayne, G., and Danihelka, I. Neural turing ma- chines. Technical Report arXiv:1410.5401, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Huynh, T., Maire, M., and Walter, M. R. Multigrid neural memory.arXiv preprint arXiv:1906.05948,

work page arXiv 1906
[16]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., H´enaff, O., Botvinick, M., Zisserman, A., Vinyals, O., and Carreira, J. Perceiver io: A general architecture for structured inputs & outputs. Technical Report arXiv:2107.14795, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

Karami, M., Behrouz, A., Kacham, P., and Mirrokni, V . Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

work page arXiv
[18]

Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

work page arXiv 1911
[19]

Reformer: The Efficient Transformer

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[20]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, C., Sheng, Y ., Zheng, L., Yu, C., Gonzalez, J. E., and Stoica, I. Efficient memory management for large language model serving with page- dattention.arXiv preprint arXiv:2309.06180,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Stoy- anov, V ., and Riedel, S. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[22]

Liu, A. et al. Deepseek-v3 technical report. Technical Report arXiv:2412.19437, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Video swin transformer.arXiv preprint arXiv:2106.13230,

Liu, Z., Ning, J., Cao, Y ., Wei, Y ., Zhang, Z., Lin, S., and Hu, H. Video swin transformer.arXiv preprint arXiv:2106.13230,

work page arXiv
[24]

and Seghouani, N

Malhotra, A. and Seghouani, N. Neural field turing ma- chine: A differentiable spatial computer.arXiv preprint arXiv:2509.03370,

work page arXiv
[25]

Pointer Sentinel Mixture Models

URL https: //arxiv.org/abs/1609.07843. Munkhdalai, T., Faruqui, M., and Gopal, S. Leave no con- text behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Neural Map: Structured Memory for Deep Reinforcement Learning

Parisotto, E. and Salakhutdinov, R. Neural map: Struc- tured memory for deep reinforcement learning. Technical Report arXiv:1702.08360, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Compressive Transformers for Long-Range Sequence Modelling

Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. Technical Report arXiv:1911.05507, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[28]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[29]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., and Hinton, G. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Shi, X., Chen, Z., Wang, H., Yeung, D.-Y ., Wong, W.-k., and Woo, W.-c. Convolutional lstm network: A machine learning approach for precipitation nowcasting.arXiv preprint arXiv:1506.04214,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

DeepVoxels: Learning Persistent 3D Feature Embeddings

Sitzmann, V ., Thies, J., Heide, F., Nießner, M., Wetzstein, G., and Zollh¨ofer, M. Deepvoxels: Learning persistent 3d feature embeddings. Technical Report arXiv:1812.01024, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

End-To-End Memory Networks

Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks.arXiv preprint arXiv:1503.08895,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J ´egou, H. Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

work page arXiv 2012
[35]

The caltech-ucsd birds-200-2011 dataset

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology,

2011
[36]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[37]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Big Bird: Transformers for Longer Sequences

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Onta˜n´on, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big bird: Transformers for longer sequences.arXiv preprint arXiv:2007.14062,

work page internal anchor Pith review Pith/arXiv arXiv 2007
[39]

Neural slam: Learning to explore with external memory

Zhang, J., Tai, L., Liu, M., Boedecker, J., and Burgard, W. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520,

work page arXiv
[40]

Zou, Z. et al. M3: 3d-spatial multimodal memory.arXiv preprint arXiv:2503.16413,

work page arXiv

[1] [1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

Bertasius, G., Wang, H., and Torresani, L. Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

work page arXiv

[4] [4]

Improving language models by retrieving from trillions of tokens

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models 8 Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers by retrieving from trillions of tokens.arXiv preprint arXiv:2112.04426,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Bulatov, A., Kuratov, Y ., and Burtsev, M. S. Recurrent memory transformer.arXiv preprint arXiv:2207.06881,

work page arXiv

[6] [6]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. Technical Report arXiv:2601.07372, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Rethinking Attention with Performers

Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[8] [8]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Dai, Z., Yang, Z., Yang, Y ., Carbonell, J., Le, Q. V ., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. Technical Report arXiv:1901.02860, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[9] [9]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[11] [11]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Neural Turing Machines

Graves, A., Wayne, G., and Danihelka, I. Neural turing ma- chines. Technical Report arXiv:1410.5401, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Huynh, T., Maire, M., and Walter, M. R. Multigrid neural memory.arXiv preprint arXiv:1906.05948,

work page arXiv 1906

[16] [16]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., H´enaff, O., Botvinick, M., Zisserman, A., Vinyals, O., and Carreira, J. Perceiver io: A general architecture for structured inputs & outputs. Technical Report arXiv:2107.14795, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

Karami, M., Behrouz, A., Kacham, P., and Mirrokni, V . Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

work page arXiv

[18] [18]

Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

work page arXiv 1911

[19] [19]

Reformer: The Efficient Transformer

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[20] [20]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, C., Sheng, Y ., Zheng, L., Yu, C., Gonzalez, J. E., and Stoica, I. Efficient memory management for large language model serving with page- dattention.arXiv preprint arXiv:2309.06180,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Stoy- anov, V ., and Riedel, S. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[22] [22]

Liu, A. et al. Deepseek-v3 technical report. Technical Report arXiv:2412.19437, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Video swin transformer.arXiv preprint arXiv:2106.13230,

Liu, Z., Ning, J., Cao, Y ., Wei, Y ., Zhang, Z., Lin, S., and Hu, H. Video swin transformer.arXiv preprint arXiv:2106.13230,

work page arXiv

[24] [24]

and Seghouani, N

Malhotra, A. and Seghouani, N. Neural field turing ma- chine: A differentiable spatial computer.arXiv preprint arXiv:2509.03370,

work page arXiv

[25] [25]

Pointer Sentinel Mixture Models

URL https: //arxiv.org/abs/1609.07843. Munkhdalai, T., Faruqui, M., and Gopal, S. Leave no con- text behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Neural Map: Structured Memory for Deep Reinforcement Learning

Parisotto, E. and Salakhutdinov, R. Neural map: Struc- tured memory for deep reinforcement learning. Technical Report arXiv:1702.08360, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Compressive Transformers for Long-Range Sequence Modelling

Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. Technical Report arXiv:1911.05507, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[28] [28]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[29] [29]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., and Hinton, G. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Shi, X., Chen, Z., Wang, H., Yeung, D.-Y ., Wong, W.-k., and Woo, W.-c. Convolutional lstm network: A machine learning approach for precipitation nowcasting.arXiv preprint arXiv:1506.04214,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

DeepVoxels: Learning Persistent 3D Feature Embeddings

Sitzmann, V ., Thies, J., Heide, F., Nießner, M., Wetzstein, G., and Zollh¨ofer, M. Deepvoxels: Learning persistent 3d feature embeddings. Technical Report arXiv:1812.01024, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

End-To-End Memory Networks

Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks.arXiv preprint arXiv:1503.08895,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J ´egou, H. Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

work page arXiv 2012

[35] [35]

The caltech-ucsd birds-200-2011 dataset

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology,

2011

[36] [36]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[37] [37]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Big Bird: Transformers for Longer Sequences

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Onta˜n´on, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big bird: Transformers for longer sequences.arXiv preprint arXiv:2007.14062,

work page internal anchor Pith review Pith/arXiv arXiv 2007

[39] [39]

Neural slam: Learning to explore with external memory

Zhang, J., Tai, L., Liu, M., Boedecker, J., and Burgard, W. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520,

work page arXiv

[40] [40]

Zou, Z. et al. M3: 3d-spatial multimodal memory.arXiv preprint arXiv:2503.16413,

work page arXiv