pith. sign in

arxiv: 2605.27686 · v1 · pith:5JQ7EVSVnew · submitted 2026-05-26 · 💻 cs.CV · cs.AI

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Pith reviewed 2026-06-29 18:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords recurrent memorytransformer augmentationvideo understandingspatial statefixed-size memorylong-horizon reasoningvoxel gridsoft write
0
0 comments X

The pith

A fixed-size 3D memory tensor lets Transformers keep constant state capacity across arbitrarily long image or video sequences while retaining spatial structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tensor Memory as a lightweight add-on to Transformer blocks. Tokens write into a voxel grid through a soft Gaussian deposit around a predicted location, the grid updates via local interactions and gated recurrence, and tokens read back through continuous sampling with residual fusion. Because the tensor size never grows with sequence length, the approach separates memory capacity from input duration. A sympathetic reader would care because standard attention and KV caches scale memory with length and lose explicit spatial persistence, which limits long video reasoning and occlusion handling. The module integrates into existing pipelines without other changes and is tested on language, image, video, and diagnostic tasks.

Core claim

Tensor Memory augments Transformer blocks with a constant-size recurrent 3D memory tensor whose content is written by differentiable soft deposits of Gaussian volumes at predicted continuous 3D locations, updated by local interaction operators and gated recurrent dynamics, and read by continuous sampling with gated residual fusion; the fixed size therefore decouples state capacity from sequence length while preserving spatial inductive bias.

What carries the argument

The fixed-size recurrent 3D memory tensor that receives soft Gaussian-weighted writes at predicted continuous locations, applies local interaction and gated recurrence, and supplies context via continuous reads with residual fusion.

If this is right

  • Memory footprint stays constant even when input sequences become arbitrarily long.
  • The module can be inserted into or removed from standard Transformer blocks without altering other architecture or training code.
  • Spatial inductive bias is retained because writes and reads operate on a 3D grid rather than a flat token list.
  • The same module works for language, static images, and video without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuous 3D location prediction may allow the memory to represent objects that move smoothly between discrete voxel centers, something a fixed grid without soft writes would lose.
  • Because reads use gated residual fusion, the memory can act as an optional spatial prior rather than a mandatory replacement for attention.
  • The design could be tested on tasks that require explicit 3D world modeling, such as multi-view reconstruction from video, even though the paper evaluates only 2D benchmarks.

Load-bearing premise

That the soft Gaussian write combined with gated updates can keep the fixed 3D grid carrying useful spatial details across many time steps without quick loss or blurring.

What would settle it

Measure whether models using the module maintain higher accuracy than baseline Transformers on long video sequences that require tracking the same objects through repeated occlusions; equal or worse performance would falsify the claim of useful persistent spatial state.

Figures

Figures reproduced from arXiv: 2605.27686 by Antonio Torralba, Daniel Karl I. Weidele, Kabir Swain, Mauro Martino, Sijie Han.

Figure 1
Figure 1. Figure 1: Tensor Memory: a fixed-size 3D voxel state with continuous writes, local recurrent updates, and continuous reads. (1) The memory is a pair of volumes ht, ct ∈ R C×D×H×W (hidden and cell states); each of the D×H×W voxels stores a C-dimensional feature vector and the spatial layout is fixed and constant in input length. (2) Each step t emits a write package (µ (t) write, content(t) , σ(t) ) that deposits a G… view at source ↗
Figure 2
Figure 2. Figure 2: Tensor Memory module, end-to-end per chunk. (1) The chunk t consists of K token embeddings x (t) 1 , . . . , x (t) K . (2) We form two summaries of the chunk: a read query x (t) read = x (t) 1 (first token) and a write summary x (t) write = Wwp ⃗x(t) 1:K (a learned projection over the whole chunk). (3) A shared coordinate head with tied weights produces the read coordinate µ (t) read and the write coordina… view at source ↗
Figure 3
Figure 3. Figure 3: Tensor Memory in action, illustrated on three input chunks. (1) The input is partitioned into chunks; here, each chunk is a small group of patches over a single frame, drawn in a distinct colour. (2) For each chunk i, the write head predicts a continuous 3D coordinate µ (i) write ∈ [−1, 1]3 in the voxel memory; sphere radius is proportional to the predicted Gaussian write spread σ (i) . (3) Before each wri… view at source ↗
read the original abstract

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces Tensor Memory, a module augmenting Transformer blocks with a fixed-size recurrent 3D memory tensor. Tokens perform differentiable soft writes that deposit content as Gaussian-weighted volumes around predicted continuous 3D locations into a voxel grid; the memory is updated via an efficient local interaction operator and gated recurrent dynamics; tokens read context via continuous sampling with gated residual fusion. The fixed tensor size decouples state capacity from input length while retaining a spatial inductive bias. The module is evaluated on language, image, and video benchmarks plus a controlled toy diagnostic suite for isolating persistent-state benefits, and integrates with standard training pipelines without other architectural changes.

Significance. The core architectural property—that a constant-size 3D tensor guarantees capacity independent of sequence length—is achieved by construction and directly addresses the stated limitations of growing KV caches and absent explicit spatial state. The toy diagnostic suite is a positive design choice for testing when persistent state matters. If the empirical results on the benchmarks confirm stable long-horizon retention without rapid degradation, the module would be a practical, attachable addition for video and long-context tasks.

minor comments (2)
  1. [Abstract] The abstract states that the module 'can be attached to or removed from existing blocks without other architectural changes,' but does not specify the exact insertion points or any required hyper-parameter retuning; a short paragraph in §3 or §4 clarifying this would improve reproducibility.
  2. [Abstract] Notation for the soft-write operator, local interaction, and gated dynamics is introduced in the abstract without symbols; defining them with consistent symbols in the methods section would aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of the fixed-size 3D memory property, and recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core statement that a fixed-size 3D memory tensor decouples capacity from sequence length follows directly from the explicit design choice of constant tensor size and is presented as a descriptive property of the module rather than a derived prediction or theorem. No equations, fitted parameters, self-citations, or uniqueness claims are invoked in the abstract or described architecture to support the claim; the spatial bias is likewise supplied by the voxel grid and continuous read/write operations by construction. The derivation chain is self-contained with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the module description implies unstated choices such as grid resolution and Gaussian width that would need to be treated as design decisions in any implementation.

pith-pipeline@v0.9.1-grok · 5726 in / 1216 out tokens · 21433 ms · 2026-06-29T18:06:39.073175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AURA: Action-Gated Memory for Robot Policies at Constant VRAM

    cs.AI 2026-06 unverdicted novelty 7.0

    AURA-Mem uses an action-gated recurrent memory trained on closed-loop action error to deliver constant 4,224-byte state and 5-9x fewer writes than baselines while matching base policy success on LIBERO-Long.

Reference graph

Works this paper leans on

40 extracted references · 39 canonical work pages · cited by 1 Pith paper · 29 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebr´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check- points.arXiv preprint arXiv:2305.13245,

  2. [2]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

  3. [3]

    Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

    Bertasius, G., Wang, H., and Torresani, L. Is space-time attention all you need for video understanding?arXiv preprint arXiv:2102.05095,

  4. [4]

    Improving language models by retrieving from trillions of tokens

    Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models 8 Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers by retrieving from trillions of tokens.arXiv preprint arXiv:2112.04426,

  5. [5]

    Bulatov, A., Kuratov, Y ., and Burtsev, M. S. Recurrent memory transformer.arXiv preprint arXiv:2207.06881,

  6. [6]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D., and Liang, W. Conditional memory via scalable lookup: A new axis of sparsity for large language models. Technical Report arXiv:2601.07372, arXiv,

  7. [7]

    Rethinking Attention with Performers

    Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., and Weller, A. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

  8. [8]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Dai, Z., Yang, Z., Yang, Y ., Carbonell, J., Le, Q. V ., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. Technical Report arXiv:1901.02860, arXiv,

  9. [9]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and R´e, C. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135,

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  11. [11]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961,

  12. [12]

    Neural Turing Machines

    Graves, A., Wayne, G., and Danihelka, I. Neural turing ma- chines. Technical Report arXiv:1410.5401, arXiv,

  13. [13]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  14. [14]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861,

  15. [15]

    Huynh, T., Maire, M., and Walter, M. R. Multigrid neural memory.arXiv preprint arXiv:1906.05948,

  16. [16]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., H´enaff, O., Botvinick, M., Zisserman, A., Vinyals, O., and Carreira, J. Perceiver io: A general architecture for structured inputs & outputs. Technical Report arXiv:2107.14795, arXiv,

  17. [17]

    Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

    Karami, M., Behrouz, A., Kacham, P., and Mirrokni, V . Trel- lis: Learning to compress key-value memory in attention models.arXiv preprint arXiv:2512.23852,

  18. [18]

    Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

    Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models.arXiv preprint arXiv:1911.00172,

  19. [19]

    Reformer: The Efficient Transformer

    Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

  20. [20]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Kwon, W., Li, Z., Zhuang, C., Sheng, Y ., Zheng, L., Yu, C., Gonzalez, J. E., and Stoica, I. Efficient memory management for large language model serving with page- dattention.arXiv preprint arXiv:2309.06180,

  21. [21]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Stoy- anov, V ., and Riedel, S. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401,

  22. [22]

    Liu, A. et al. Deepseek-v3 technical report. Technical Report arXiv:2412.19437, arXiv,

  23. [23]

    Video swin transformer.arXiv preprint arXiv:2106.13230,

    Liu, Z., Ning, J., Cao, Y ., Wei, Y ., Zhang, Z., Lin, S., and Hu, H. Video swin transformer.arXiv preprint arXiv:2106.13230,

  24. [24]

    and Seghouani, N

    Malhotra, A. and Seghouani, N. Neural field turing ma- chine: A differentiable spatial computer.arXiv preprint arXiv:2509.03370,

  25. [25]

    Pointer Sentinel Mixture Models

    URL https: //arxiv.org/abs/1609.07843. Munkhdalai, T., Faruqui, M., and Gopal, S. Leave no con- text behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143,

  26. [26]

    Neural Map: Structured Memory for Deep Reinforcement Learning

    Parisotto, E. and Salakhutdinov, R. Neural map: Struc- tured memory for deep reinforcement learning. Technical Report arXiv:1702.08360, arXiv,

  27. [27]

    Compressive Transformers for Long-Range Sequence Modelling

    Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. Technical Report arXiv:1911.05507, arXiv,

  28. [28]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150,

  29. [29]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., and Hinton, G. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

  30. [30]

    Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

    Shi, X., Chen, Z., Wang, H., Yeung, D.-Y ., Wong, W.-k., and Woo, W.-c. Convolutional lstm network: A machine learning approach for precipitation nowcasting.arXiv preprint arXiv:1506.04214,

  31. [31]

    DeepVoxels: Learning Persistent 3D Feature Embeddings

    Sitzmann, V ., Thies, J., Heide, F., Nießner, M., Wetzstein, G., and Zollh¨ofer, M. Deepvoxels: Learning persistent 3d feature embeddings. Technical Report arXiv:1812.01024, arXiv,

  32. [32]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

  33. [33]

    End-To-End Memory Networks

    Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks.arXiv preprint arXiv:1503.08895,

  34. [34]

    Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J ´egou, H. Training data-efficient image trans- formers & distillation through attention.arXiv preprint arXiv:2012.12877,

  35. [35]

    The caltech-ucsd birds-200-2011 dataset

    Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology,

  36. [36]

    Linformer: Self-Attention with Linear Complexity

    Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

  37. [37]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  38. [38]

    Big Bird: Transformers for Longer Sequences

    Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Onta˜n´on, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. Big bird: Transformers for longer sequences.arXiv preprint arXiv:2007.14062,

  39. [39]

    Neural slam: Learning to explore with external memory

    Zhang, J., Tai, L., Liu, M., Boedecker, J., and Burgard, W. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520,

  40. [40]

    Zou, Z. et al. M3: 3d-spatial multimodal memory.arXiv preprint arXiv:2503.16413,