pith. sign in

arxiv: 2605.30519 · v1 · pith:47CORIUGnew · submitted 2026-05-28 · 💻 cs.CV

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Pith reviewed 2026-06-29 07:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video generationautoregressive video modelsKV cache retrievalsparse attentionmemory efficiencyvideo synthesisdynamic degreetemporal consistency
0
0 comments X

The pith

OmniMem performs sparse retrieval over the full historical KV cache to generate longer videos without the detail loss from truncation or compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive video generation builds videos chunk by chunk but must repeatedly consult an expanding record of past computations stored in a KV cache. Existing solutions either discard older entries or fold them into a compressed form, both of which remove explicit access to details that may matter later. OmniMem keeps the entire cache and instead selects only a sparse subset of relevant past entries for each new chunk. Three targeted mechanisms counter the tendency of sparse selection to favor recent blocks and to produce overly large memory buffers. On long-video benchmarks the approach raises measured dynamic degree by 52.3 percent relative to strong baselines while holding consistency and memory footprint comparable.

Core claim

OmniMem is an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. Adaptive Window Exclusion removes local-window blocks from selection candidates once sufficient long-range history exists. Query-Shared KV Selection reduces cross-query diversity. Per-Head Scattered KV Access lets each attention head retrieve non-contiguous KV blocks according to its own pattern, avoiding union explosion in the selected buffer.

What carries the argument

Sparse KV retrieval over the full historical cache, implemented through Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access.

If this is right

  • Longer video sequences become feasible at fixed memory budget because the full explicit history remains available.
  • Dynamic degree improves by 52.3 percent while consistency metrics stay strong.
  • Memory usage remains comparable to truncation or compression baselines.
  • Each attention head can follow its own non-contiguous retrieval pattern without expanding the selected buffer size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse-retrieval pattern could be tested on autoregressive models for long text or audio to check whether explicit memory access outperforms compression there as well.
  • If retrieval accuracy holds, training runs could avoid the need to lengthen context windows solely to capture distant dependencies.
  • Per-head scattered access suggests that hardware kernels optimized for irregular sparse loads may become performance-critical for scaling this style of generation.

Load-bearing premise

The three sparse-selection techniques can reliably locate and fetch the query-relevant historical details that truncation or compression would otherwise discard.

What would settle it

A controlled video sequence in which an early event required for later consistency is never selected by the retrieval mechanism, producing measurable drops in temporal coherence or dynamic degree.

Figures

Figures reproduced from arXiv: 2605.30519 by Lin Zhao, Pu Zhao, Yanzhi Wang, Yifan Gong, Yushu Wu.

Figure 1
Figure 1. Figure 1: OmniMem preserves object identity while maintaining rich motion in long video generation. SWA shows object drift, and Sink-SWA produces repetitive motion. Abstract Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing his￾torical KV cache. Existing methods reduce this cost by truncating the KV cache or c… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OmniMem. Current-chunk queries attend to recent KV blocks, pooled historical KV blocks, and retrieved full-resolution KV blocks through sliding-window, compression, and selection attention, respectively. The right panels summarize the key retrieval and access designs: filtering near-window candidates before Top-K selection, sharing Top-K selection within query groups, and accessing per-head blo… view at source ↗
Figure 3
Figure 3. Figure 3: Local bias and Union Explosion in selection attention. (a) Top-K selection focuses near the current chunk without AWE, and shifts to long-range blocks with AWE. (b) Different query chunks in one head select different blocks. (c) Different heads also select different regions. (b) and (c) together cause Union Explosion. Note that each chunk contains a number of tokens (e.g., 4-5K). 3.1 Framework Problem Form… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on long-video generation. Red boxes highlight repetitive frames where LongLive [18] collapses back to early content. Full videos and additional results are provided in the supplementary material [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Memory access scalability. Naive Sparse reduces latency but loses the memory benefit due to Union Explosion. OmniMem maintains memory usage nearly constant while remaining efficient. common KV selection. Sharing the selection across all 12 heads significantly degrades all metrics, and even a moderate size of Gh = 3 still leaves a clear gap to per-head selection. This indicates that different attention head… view at source ↗
read the original abstract

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes OmniMem, an explicit full-range memory retrieval framework for autoregressive chunk-based long video generation. It introduces three sparse KV retrieval mechanisms—Adaptive Window Exclusion (to counter local bias when long-range history is available), Query-Shared KV Selection (to reduce cross-query diversity), and Per-Head Scattered KV Access (to avoid union explosion by allowing per-head non-contiguous block selection)—that together enable query-relevant historical KV access without truncation or compression. Experiments report a 52.3% gain in Dynamic Degree over strong baselines while preserving consistency and comparable memory usage.

Significance. If the empirical results and the claim that the three mechanisms recover relevant historical details without information loss hold under scrutiny, the work would meaningfully advance scalable AR video generation by retaining explicit long-range access at manageable cost. This addresses a core scaling bottleneck and could influence subsequent memory-efficient video and multimodal generation systems.

major comments (2)
  1. [Abstract] Abstract: the central claim of a 52.3% Dynamic Degree improvement is presented without any description of the baseline methods, dataset, number of videos or frames evaluated, variance across runs, or statistical significance; this single quantitative result is load-bearing for the paper's contribution and cannot be assessed from the given information.
  2. [Abstract / Method] The manuscript's core assumption—that Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access together surface query-relevant historical KV without the loss incurred by truncation or compression—requires explicit supporting evidence (e.g., ablation tables isolating each component, attention visualizations, or retrieval-precision metrics) to be load-bearing; the abstract alone does not supply this verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for stronger verification of our core claims. We address each major comment below and will revise the manuscript to improve self-containment and evidence presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 52.3% Dynamic Degree improvement is presented without any description of the baseline methods, dataset, number of videos or frames evaluated, variance across runs, or statistical significance; this single quantitative result is load-bearing for the paper's contribution and cannot be assessed from the given information.

    Authors: We agree the abstract should be more self-contained for the load-bearing quantitative claim. The experimental section details the baselines (strong AR video generation methods with KV cache management), the long-video dataset, evaluation scale (multiple videos with extended frame counts), and reports averaged results with variance. In revision we will expand the abstract to concisely include these elements (baselines, dataset, scale, and note on averaging/variance) while preserving length limits. revision: yes

  2. Referee: [Abstract / Method] The manuscript's core assumption—that Adaptive Window Exclusion, Query-Shared KV Selection, and Per-Head Scattered KV Access together surface query-relevant historical KV without the loss incurred by truncation or compression—requires explicit supporting evidence (e.g., ablation tables isolating each component, attention visualizations, or retrieval-precision metrics) to be load-bearing; the abstract alone does not supply this verification.

    Authors: The overall experimental results (52.3% Dynamic Degree gain with preserved consistency and comparable memory) provide empirical support for the combined mechanisms recovering relevant history without truncation/compression loss. Component-wise ablations, attention visualizations, and retrieval analysis appear in the experiments and supplementary sections. To make this verification more explicit and tied to the abstract claim, we will add or highlight ablation tables isolating each mechanism, plus attention/retrieval-precision figures in the main paper during revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript describes an explicit sparse-retrieval framework (Adaptive Window Exclusion, Query-Shared KV Selection, Per-Head Scattered KV Access) whose design choices are stated directly and whose performance is reported solely as empirical outcomes on long-video generation benchmarks. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the supplied text. The central claim therefore rests on observable experimental deltas rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities identifiable. No equations or modeling assumptions stated beyond the high-level problem description.

pith-pipeline@v0.9.1-grok · 5738 in / 1005 out tokens · 25834 ms · 2026-06-29T07:54:59.348268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 28 canonical work pages · 17 internal anchors

  1. [1]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  2. [2]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  3. [3]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

  4. [4]

    Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge

    Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Pu Zhao, Jun Lin, and Jiuxiang Gu. Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge. InThe Fourteenth International Conference on Learning Representations, 2026

  5. [5]

    Draftattention: Fast video diffusion via low-resolution attention guidance

    Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. Draftattention: Fast video diffusion via low-resolution attention guidance. arXiv preprint arXiv:2505.14708, 2025

  6. [6]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  7. [7]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  8. [8]

    S2dit: Sandwich diffusion transformer for mobile streaming video generation.arXiv preprint arXiv:2601.12719, 2026

    Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, et al. S2dit: Sandwich diffusion transformer for mobile streaming video generation.arXiv preprint arXiv:2601.12719, 2026

  9. [9]

    Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

  10. [11]

    Pyramidal flow matching for efficient video generative modeling,

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954, 2024

  11. [12]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

  12. [13]

    Gamegen-x: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, 2025

  13. [14]

    Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

    Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, and Haoqi Fan. Causal diffusion transformers for generative modeling.arXiv preprint arXiv:2412.12095, 2024

  14. [15]

    Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

    Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, et al. Taming diffusion transformer for efficient mobile video generation in seconds.arXiv preprint arXiv:2507.13343, 2025

  15. [16]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

  16. [17]

    Genie: Generative interactive environments.Forty-first International Conference on Machine Learning, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Si...

  17. [18]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  18. [19]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  19. [20]

    Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699,

    Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

  20. [21]

    arXiv preprint arXiv:2512.04519 , year=

    Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory. arXiv preprint arXiv:2512.04519, 2025

  21. [22]

    TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

    Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression. arXiv preprint arXiv:2512.23851, 2025

  22. [23]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

  23. [24]

    ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W

    Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan, Jie Wang, Jiexin Sheng, Min Cui, Min Hu, Ming Yan, Shuchen...

  24. [25]

    Skyreels-v2: Infinite-length film generative model, 2025

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model, 2025

  25. [26]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  26. [27]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

  27. [28]

    Mode seeking meets mean seeking for fast long video generation

    Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, and Arash Vahdat. Mode seeking meets mean seeking for fast long video generation. InarXiv, 2026

  28. [29]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  29. [30]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  30. [31]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37, 2024

  31. [32]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 11

  32. [33]

    Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

    Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization.arXiv preprint arXiv:2602.02958, 2026

  33. [34]

    BIFE: Better Interaction, Fewer Errors for Minute-Long Video Generation

    Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

  34. [35]

    Efficient autoregressive video diffusion with dummy head

    Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai, Jiangshan Wang, Yawei Li, and Yan Lu. Efficient autoregressive video diffusion with dummy head.arXiv preprint arXiv:2601.20499, 2026

  35. [36]

    Past-and future-informed kv cache policy with salience estimation in autoregressive video diffusion

    Xu Yang et al. Past- and future-informed kv cache policy with salience estimation in autoregressive video diffusion.arXiv preprint arXiv:2601.21896, 2026

  36. [37]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  37. [38]

    Wan: Open and advanced large-scale video generative models, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  38. [39]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models.Advances in Neural Information Processing Systems, 37:65618–65642, 2024

  39. [40]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

  40. [41]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  41. [42]

    Autoregressive Video Generation without Vector Quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

  42. [43]

    A dynamic o v er -t he-shoulder perspectiv e of a chef meticulously plating a dish in a bust ling kit chen

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Limitation and Broader Impact OmniMem is evaluated on a single open-sourced DiT backbone, Wan2.1-T2V-1.3B, aligned with recent works. This controlled setting helps isolate the effect ...