pith. sign in

arxiv: 2605.31033 · v1 · pith:YGC4TW5Qnew · submitted 2026-05-29 · 💻 cs.CV

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

Pith reviewed 2026-06-28 22:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming video generationobject-centric memorykey-value memoryvideo diffusionlong-form video synthesissemantic slotsentity persistencedynamic consistency
0
0 comments X

The pith

Decomposing the transformer's key-value manifold into discrete semantic slots enables entity-level persistence for streaming long-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Streaming video generation models typically organize historical context as raw frames, chunk segments, or unclustered tokens, which leads to identity drift and semantic inconsistency when entities leave the frame or prompts change. SlotMemory addresses this by shifting the memory abstraction from temporal occurrence to semantic representation through decomposition of the key-value manifold into discrete reusable slots. These slots serve as routing addresses to index and store high-fidelity key-value tokens, supporting entity-level persistence and prompt-aware retrieval over long sequences. On 60-second interactive narratives with the Wan2.1-T2V-1.3B backbone, the method reaches a quality score of 81.61 and a 22.8 percent relative gain in dynamic consistency compared to prior streaming baselines. The work concludes that structured semantic representation, rather than raw temporal capacity, forms the essential primitive for persistent long-form video synthesis.

Core claim

By decomposing the transformer's key-value manifold into discrete, reusable semantic slots and utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, SlotMemory enables entity-level persistence and prompt-aware retrieval across long horizons in streaming video diffusion, yielding state-of-the-art quality of 81.61 and a 22.8 percent relative improvement in dynamic consistency on 60-second interactive narratives.

What carries the argument

SlotMemory, the object-centric Key-Value memory mechanism that decomposes the transformer's key-value manifold into discrete reusable semantic slots used as routing addresses for token storage and retrieval.

If this is right

  • Entity-level persistence holds across frames where objects exit and re-enter the scene.
  • Prompt-aware retrieval occurs automatically during interactive prompt transitions without extra mechanisms.
  • Dynamic consistency improves by 22.8 percent relative to the strongest existing streaming baseline.
  • Overall quality reaches 81.61 on 60-second narratives using the Wan2.1-T2V-1.3B backbone.
  • Structured semantic representation outperforms raw temporal capacity as the core requirement for long-form synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The slot decomposition may extend to other transformer-based generative tasks that require tracking distinct entities over time.
  • If slots remain stable under varying diffusion noise schedules, the approach could reduce the need for explicit object tracking modules in video pipelines.
  • Prompt transitions might become more reliable in multi-turn interactive settings because retrieval is indexed by semantic identity rather than recency.

Load-bearing premise

Decomposing the transformer's key-value manifold into discrete reusable semantic slots will produce entity-level persistence and prompt-aware retrieval without introducing new forms of inconsistency or requiring additional supervision.

What would settle it

If 60-second interactive video generations produced with SlotMemory still exhibit measurable identity drift or semantic inconsistency matching the levels of temporal-centric baselines, the claim that semantic slots suffice for entity-level persistence would be falsified.

Figures

Figures reproduced from arXiv: 2605.31033 by Hui Li, Jiahao Cui, Jingdong Wang, Lei Zhou, Siyu Zhu, Weijia Dou.

Figure 1
Figure 1. Figure 1: Teaser examples of long-form subject persistence generated by SlotMemory. Each row shows six keyframes from one sequence, covering scene transitions, action changes, and temporary occlusion. Across three different narratives (cat, rider, dog), the main subject identity and key visual anchors remain consistent over time. is evicted. By embedding this module directly into the attention stack of the diffusion… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of SlotMemory in streaming long-video generation. For each chunk, the model retrieves slot-indexed long-term KV memory, denoises the current chunk using both local and retrieved context, writes new slot-conditioned memory items from current transformer states, and updates the memory bank under a fixed budget. At prompt switches, prompt-aware retrieval retains reusable entities while allowing the g… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on an interactive Mars mission script. Each row is one method and each column is a key moment in the six-stage narrative (surface exploration, rover approach, collaborative scan/drill, sample handling, and return toward base). Our method (bottom row) maintains more stable subject identity, astronaut-rover spatial relations, and action readability across the full sequence. 3.5 Trainin… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation comparison on a multi-stage “birthday candle” narrative. Columns show keyframes at 12s, 24s, 36s, 48s, and 53s from the same prompt (dim kitchen, boy, cake/candle, and mother joining later). Rows correspond to progressive settings: w/o Slot Compression, + Slot Module, + Contrastive Loss, and + Reconstruction Loss. As components are added, object-count stability, character interaction c… view at source ↗
Figure 5
Figure 5. Figure 5: Segment-wise CLIP comparison across memory designs and bank sizes. We compare Frame Sink, NAM, and SlotMemory with different bank sizes (𝑏 ∈ {3, 6, 9, 12}). SlotMemory consistently outperforms Frame Sink and NAM across all 10-second segments, and larger banks improve mid/late-segment CLIP with diminishing returns beyond moderate capacity. 4.3 Ablation Studies We conduct a series of ablation experiments und… view at source ↗
Figure 6
Figure 6. Figure 6: Failure Cases of SlotMemory: Attribute Leakage and Scene Regression Artifacts [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on an interactive script [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative results on an interactive script [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at https://tj12323.github.io/SlotMemory/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SlotMemory, an object-centric KV memory for streaming video diffusion models. It decomposes the transformer's key-value manifold into discrete reusable semantic slots to enable entity-level persistence and prompt-aware retrieval, addressing identity drift in long videos. Using the Wan2.1-T2V-1.3B backbone on 60-second interactive narratives, it reports a quality score of 81.61 and a 22.8% relative improvement in dynamic consistency over the strongest streaming baseline, concluding that structured semantic representation, rather than raw temporal capacity, is essential for persistent long-form video synthesis. Code and checkpoints are released.

Significance. If the quantitative claims hold under rigorous controls, the work provides evidence that shifting memory abstraction from temporal to semantic slots can improve consistency in streaming generation without extra supervision. The public release of code and checkpoints strengthens reproducibility and allows direct verification of the slot routing and storage mechanisms.

major comments (2)
  1. [Experiments] Experiments section: the reported 22.8% dynamic consistency gain and 81.61 quality score lack a complete description of the evaluation protocol, including exact metric definitions, baseline implementations, number of runs, and statistical significance testing. Without these, it is impossible to determine whether the gains arise from the slot decomposition itself or from unstated differences in capacity, auxiliary losses, or post-processing.
  2. [Method] Method section: the slot formation process, routing addresses, and training objective are described at a high level but do not specify whether slot assignment relies on any learned components beyond the base diffusion loss. This leaves open the possibility that the comparison to temporal-centric baselines is confounded by implicit supervision or capacity increases.
minor comments (1)
  1. [Abstract] The abstract states 'state-of-the-art quality score of 81.61' without naming the underlying metric or its range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported 22.8% dynamic consistency gain and 81.61 quality score lack a complete description of the evaluation protocol, including exact metric definitions, baseline implementations, number of runs, and statistical significance testing. Without these, it is impossible to determine whether the gains arise from the slot decomposition itself or from unstated differences in capacity, auxiliary losses, or post-processing.

    Authors: We agree that the evaluation protocol requires fuller specification. In the revised manuscript we will expand the Experiments section with: (i) precise definitions and computation details for the quality score and dynamic consistency metric, (ii) exact baseline implementations including any capacity or auxiliary-loss matching, (iii) the number of independent runs (five), and (iv) statistical significance results (paired t-tests with p-values). These additions will confirm that reported gains derive from the slot decomposition rather than confounding factors. revision: yes

  2. Referee: [Method] Method section: the slot formation process, routing addresses, and training objective are described at a high level but do not specify whether slot assignment relies on any learned components beyond the base diffusion loss. This leaves open the possibility that the comparison to temporal-centric baselines is confounded by implicit supervision or capacity increases.

    Authors: Slot assignment and routing in SlotMemory emerge directly from the base diffusion loss and the transformer's attention mechanism; no auxiliary learned components or supervision are introduced. To remove ambiguity we will revise the Method section with explicit statements, additional equations, and pseudocode confirming that slot formation uses only the standard diffusion objective and that baseline comparisons control for capacity. This will demonstrate the absence of confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe a proposed architectural mechanism (object-centric KV slots for streaming video) and report empirical metrics on a backbone model, but contain no equations, fitted parameters presented as predictions, self-citations invoked as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. No derivation chain is exhibited that reduces any claimed result to its inputs by construction. The central claim rests on the empirical evaluation rather than a self-referential definition or renamed known result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that semantic slot decomposition is feasible and beneficial inside existing diffusion transformers.

pith-pipeline@v0.9.1-grok · 5753 in / 1047 out tokens · 17687 ms · 2026-06-28T22:42:36.822860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

    cs.RO 2026-06 unverdicted novelty 6.0

    DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

Reference graph

Works this paper leans on

14 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    InSIGGRAPH Asia 2024 Conference Papers

    Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers. 1–11. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al

  2. [2]

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang

    Unsupervised 3D scene representation learning via movable object inference.Transactions on Machine Learning Research(2024). Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang

  3. [3]

    Martin Engelcke, Adam R Kosiorek, Oiwi Parker Jones, and Ingmar Posner

    Savi++: Towards end-to-end object- centric learning from real-world videos.Advances in Neural Information Processing Systems35 (2022), 28940–28954. Martin Engelcke, Adam R Kosiorek, Oiwi Parker Jones, and Ingmar Posner

  4. [4]

    arXiv:2503.19325 [cs.CV] https: //arxiv.org/abs/2503.19325 Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction. arXiv:2503.19325 [cs.CV] https: //arxiv.org/abs/2503.19325 Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai

  5. [5]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference. 2568–2577. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al . 2022a. Imagen video: High definitio...

  6. [6]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems38 (2026), 167283–167308. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu

  7. [7]

    doi:10.1109/TPAM I.2025.3633890 Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi SM Sajjadi, and Thomas Kipf

    VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025). doi:10.1109/TPAM I.2025.3633890 Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi SM Sajjadi, and Thomas Kipf

  8. [8]

    Thomas Kipf, Gamaleldin F Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff

    Fifo-diffusion: Gen- erating infinite videos from text without training.Advances in Neural Information Processing Systems37 (2024), 89834–89868. Thomas Kipf, Gamaleldin F Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff

  9. [9]

    Avinash Kori, Francesco Locatello, Ainkaran Santhirasekaram, Francesca Toni, Ben Glocker, and Fabio De Sousa Ribeiro

    42559–42603. Avinash Kori, Francesco Locatello, Ainkaran Santhirasekaram, Francesca Toni, Ben Glocker, and Fabio De Sousa Ribeiro. 2024b. Identifiable object-centric representation learning via probabilistic slot attention.Advances in Neural Information Processing Systems37 (2024), 93300–93335. Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Ar...

  10. [10]

    Riccardo Majellaro, Jonathan Collu, Aske Plaat, and Thomas M

    Object- centric learning with slot attention.Advances in neural information processing systems33 (2020), 11525–11538. Riccardo Majellaro, Jonathan Collu, Aske Plaat, and Thomas M. Moerland

  11. [11]

    https://openreview.net/forum?id=r8UFp9olQ0 Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, and Andrii Zada- ianchuk

    Ex- plicitly Disentangled Representations in Object-Centric Learning.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=r8UFp9olQ0 Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, and Andrii Zada- ianchuk

  12. [12]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv:2212.03191 [cs.CV] https://arxiv.org/abs/2212.03191 Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou

  13. [13]

    InProceedings of the SIGGRAPH Asia 2025 Con- ference Papers

    Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Con- ference Papers. 1–11. Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. 2025a. Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models. InThe Thirty-ninth An...

  14. [14]

    10•Dou et al

    4009–4028. 10•Dou et al. Fig. 6.Failure Cases of SlotMemory: Attribute Leakage and Scene Regression Artifacts. Fig. 7.Qualitative comparison on an interactive script. Fig. 8.More qualitative results on an interactive script. SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation•11 A Additional Implementation Details A.1 Training Recipe ...