pith. sign in

arxiv: 2605.18733 · v1 · pith:ZZB3NV3Wnew · submitted 2026-05-18 · 💻 cs.CV

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

Pith reviewed 2026-05-20 11:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords narrative video generationidentity consistencytraining-free frameworkentity trackinglong video synthesisLLM entity extractionVLM verificationvideo benchmark
0
0 comments X

The pith

A training-free memory system assigns global IDs to story entities and verifies them via vision models to prevent identity drift in long videos with shifting prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of long-term inconsistency in autoregressive video generation, where evolving prompts cause identity drift, character duplication, and loss of attributes. It proposes IAMFlow as a training-free framework that uses an LLM to extract entities and assign unique global IDs from prompts, then employs a VLM for asynchronous attribute verification on generated frames. This explicit identity tracking replaces reliance on implicit attention signals or frame compression. The work also presents NarraStream-Bench, a new evaluation set of 324 multi-prompt scripts, and demonstrates faster inference through asynchronous verification and quantization. Experiments show the method leads performance while running quicker than prior baselines.

Core claim

IAMFlow explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. An LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching.

What carries the argument

IAMFlow identity-aware memory, which stores entities under unique global IDs extracted by LLM and refined by VLM verification on frames.

If this is right

  • Outperforms the strongest baseline by 2.56 points on NarraStream-Bench overall.
  • Achieves 1.39 times speedup over the most efficient baseline in the 60-second multi-prompt setting.
  • Handles shifting entity references in evolving prompts without identity drift or attribute loss.
  • Keeps generation practical via asynchronous visual verification, adaptive prompt transition, and model quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ID-assignment approach could transfer to other time-series generation tasks such as consistent character animation or multi-shot image sequences.
  • Removing the need for domain-specific training allows the technique to pair with any improving base video model without retraining.
  • Scaling tests on scripts with many simultaneous entities or durations beyond one minute would reveal whether verification errors grow with complexity.

Load-bearing premise

The approach assumes that off-the-shelf LLMs can reliably extract entities with visual attributes from evolving prompts and assign unique global IDs, and that VLMs can asynchronously verify and refine those attributes from rendered frames without accumulating errors that break long-term consistency.

What would settle it

Generate a 60-second sequence from a script that repeatedly refers to the same character with new names and conflicting descriptions; check whether the output frames maintain one consistent appearance or show duplicated or switched identities.

read the original abstract

Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes IAMFlow, a training-free identity-aware memory framework for autoregressive narrative long video generation. It uses an LLM to extract entities with visual attributes from evolving multi-prompt scripts and assign unique global IDs, paired with asynchronous VLM verification and refinement of attributes from rendered frames to maintain long-term identity consistency. The framework incorporates an inference acceleration pipeline (asynchronous verification, adaptive transitions, quantization) and introduces NarraStream-Bench, a new benchmark with 324 multi-prompt scripts across six dimensions and a 3D evaluation protocol. Experiments claim IAMFlow achieves the best overall score on the benchmark (outperforming the strongest baseline by 2.56 points) and a 1.39× speedup in the 60-second multi-prompt setting.

Significance. If the central claims hold under rigorous validation, this work offers a practical training-free alternative to implicit attention or keyframe compression for mitigating identity drift, duplication, and attribute loss in long video generation. The explicit entity tracking via off-the-shelf LLMs/VLMs and the new NarraStream-Bench with multimodal LLM-based assessment represent potentially useful contributions to the field of controllable video synthesis.

major comments (3)
  1. Methods (entity extraction and ID assignment): The central claim that explicit LLM-based global ID assignment and VLM refinement replace implicit similarity matching to prevent drift relies on the untested assumption that off-the-shelf LLMs reliably handle shifting entity references across prompt transitions without duplication or misassignment. No error-rate measurements, failure-case analysis, or ablation on prompt ambiguity are reported, making it impossible to attribute the 2.56-point gain specifically to the identity-aware memory.
  2. Experiments and evaluation protocol: The reported 2.56-point overall improvement and 1.39× speedup on NarraStream-Bench are presented without per-dimension breakdowns, statistical significance tests, or analysis of cases where entity references evolve rapidly in the 60-second setting. This leaves open whether the gains are robust or concentrated in easier subsets of the 324 scripts.
  3. Acceleration pipeline: The description of how asynchronous VLM verification, adaptive prompt transition, and quantization interact without introducing additional latency or consistency errors is high-level; a concrete timing breakdown or pseudocode would be needed to substantiate the speedup claim as load-bearing for practicality.
minor comments (2)
  1. The abstract and introduction could more clearly distinguish the proposed explicit tracking from prior keyframe-retrieval or memory-compression baselines with a direct comparison table.
  2. Notation for global IDs and attribute refinement steps would benefit from a small diagram or algorithm box to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Methods (entity extraction and ID assignment): The central claim that explicit LLM-based global ID assignment and VLM refinement replace implicit similarity matching to prevent drift relies on the untested assumption that off-the-shelf LLMs reliably handle shifting entity references across prompt transitions without duplication or misassignment. No error-rate measurements, failure-case analysis, or ablation on prompt ambiguity are reported, making it impossible to attribute the 2.56-point gain specifically to the identity-aware memory.

    Authors: We agree that additional analysis would help substantiate the reliability of the LLM-based entity extraction and ID assignment. While the VLM verification step is designed to detect and correct misassignments or duplications by refining attributes from rendered frames, we did not report quantitative error rates or specific ablations on prompt ambiguity in the original submission. The performance gains on NarraStream-Bench, which features scripts with shifting references across its six dimensions, provide indirect evidence of the framework's effectiveness. In the revision, we will include failure-case analysis and an ablation study examining the impact of prompt ambiguity on ID assignment accuracy. revision: yes

  2. Referee: Experiments and evaluation protocol: The reported 2.56-point overall improvement and 1.39× speedup on NarraStream-Bench are presented without per-dimension breakdowns, statistical significance tests, or analysis of cases where entity references evolve rapidly in the 60-second setting. This leaves open whether the gains are robust or concentrated in easier subsets of the 324 scripts.

    Authors: We acknowledge the value of per-dimension breakdowns and statistical tests for demonstrating robustness. The benchmark was constructed to include a variety of narrative complexities, including rapid entity evolution in the multi-prompt setting. However, the original manuscript focused on overall scores. We will revise to include per-dimension results, statistical significance analysis (e.g., paired t-tests or similar), and a dedicated discussion of performance on subsets with rapidly evolving references. revision: yes

  3. Referee: Acceleration pipeline: The description of how asynchronous VLM verification, adaptive prompt transition, and quantization interact without introducing additional latency or consistency errors is high-level; a concrete timing breakdown or pseudocode would be needed to substantiate the speedup claim as load-bearing for practicality.

    Authors: We will enhance the description of the inference acceleration pipeline in the revised manuscript. Specifically, we will add a timing breakdown table showing the latency contributions of each component (asynchronous verification, adaptive transitions, and quantization) and include pseudocode illustrating their integration to ensure no additional consistency errors are introduced while achieving the reported speedup. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pipeline built on external LLMs/VLMs

full rationale

The paper describes IAMFlow as a training-free framework that composes off-the-shelf LLMs for entity extraction and global ID assignment with VLMs for asynchronous attribute refinement from rendered frames. No equations, fitted parameters, or self-citations are presented that reduce the reported performance gains (e.g., 2.56-point improvement on NarraStream-Bench) to quantities defined by the method itself. The central claims rest on the reliability of external models rather than any self-definitional loop or renamed empirical pattern, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested reliability of pretrained LLMs and VLMs for entity extraction and verification in the context of video generation prompts and frames.

axioms (2)
  • domain assumption Off-the-shelf LLMs can accurately extract entities with visual attributes from evolving prompts and assign unique global IDs without error.
    This step is required to build the identity-aware memory that replaces implicit attention mechanisms.
  • domain assumption VLMs can asynchronously verify and refine entity attributes from generated frames reliably enough to prevent drift over long sequences.
    This verification is the mechanism that enables explicit tracking and is assumed to work without additional training.

pith-pipeline@v0.9.0 · 5821 in / 1474 out tokens · 43252 ms · 2026-05-20T11:19:06.067954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 13 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 7

  2. [2]

    Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024

    Ollin Boer Bohan. Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024. 6

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 27, 28

  4. [4]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 1, 3

  5. [5]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng 11 Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 3

  6. [6]

    Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

    Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026. 3

  7. [7]

    Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

    Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025. 28

  8. [8]

    Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025

    LightX2V Contributors. Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025. 6

  9. [9]

    Self-forcing++: Towards minute-scale high-quality video generation

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 3

  10. [10]

    One-minute video generation with test-time training

    Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 3

  11. [11]

    Autoregressive video generation without vector quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 3

  12. [12]

    A survey on long-video storytelling generation: architectures, consistency, and cinematic quality

    Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Mohamed Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, et al. A survey on long-video storytelling generation: architectures, consistency, and cinematic quality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7023–7035, 2025. 1

  13. [13]

    Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

    Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025. 3

  14. [14]

    Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

    Xiaokun Feng, Haiming Yu, Meiqi Wu, Shiyu Hu, Jintao Chen, Chen Zhu, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025. 3, 7

  15. [15]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025. 3

  16. [16]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024. 7

  17. [17]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577,

  18. [18]

    Slowfast-vgen: Slow-fast learning for action-driven long video generation

    Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  19. [19]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 3, 7, 9, 29

  20. [20]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818,

  21. [21]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3, 7, 8 12

  22. [22]

    Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

    Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025. 1, 2, 3, 4, 5, 7, 8, 9

  23. [23]

    Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

    Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025. 1, 3

  24. [24]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 3

  25. [25]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6, 7

  26. [26]

    Amt: All-pairs multi-field transforms for efficient frame interpolation

    Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023. 27

  27. [27]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 3, 7, 9, 29

  28. [28]

    Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025. 3

  29. [29]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 6

  30. [30]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 1

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 27

  32. [32]

    The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904

    C Spearman. The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904. 10

  33. [33]

    T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025. 7

  34. [34]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020. 27, 28

  35. [35]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 3

  36. [36]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

  37. [37]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  38. [38]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025. 28

  39. [39]

    Moviebench: A hierarchical movie level dataset for long video generation

    Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, and Mike Zheng Shou. Moviebench: A hierarchical movie level dataset for long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28984–28994, 2025. 3

  40. [40]

    Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

    Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025. 3 13

  41. [41]

    Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

    Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization. arXiv preprint arXiv:2602.02958, 2026. 5

  42. [42]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 1

  43. [43]

    Progressive autoregressive video diffusion models

    Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6322–6332, 2025. 3

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 5, 7

  45. [45]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

  46. [46]

    1, 3, 4, 5, 7, 8, 9, 29

  47. [47]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In The Thirteenth International Conference on Learning Representations, 2025. 1, 3

  48. [48]

    Deep forcing: Training-free long video generation with deep sink and participative compression

    Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 1, 7, 9, 29

  49. [49]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025. 1, 3

  50. [50]

    Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

    Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025. 3

  51. [51]

    Packing input frame context in next-frame prediction models for video generation,

    Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation.CoRR, abs/2504.12626, 2025. 1, 3

  52. [52]

    Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

    Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025. 3

  53. [53]

    Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

    Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

  54. [54]

    another”, “new

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. InThe Twelfth International Conference on Learning Representations, 2024. 29 14 Appendix Contents A Identity-Aware Memory Implement...

  55. [55]

    entities

    "entities": ONLY human/person characters (man, woman, protagonist, etc.) - Extract ONLY visual/physical attributes: hair, clothing, accessories, body type, age, skin, facial features - DO NOT extract behavioral states (walking, nodding, reading, sitting) or emotions (quiet, contemplative, happy) - Keep entity names short OUTPUT FORMAT (JSON object only, n...

  56. [56]

    protagonist

    Words like "protagonist", "main character", "he", "she" usually refer to previously introduced characters

  57. [57]

    Matching clothing or appearance attributes indicates the same person

  58. [58]

    another",

    Words like "another", "other", "new", "different" indicate a NEW person - return null OUTPUT FORMAT (JSON only, no explanation): {"matched_id": <number or null>} [User message] New character description: "<entity>: <attr1>, <attr2>, ..." Existing characters: ID <gid>: <alias1>/<alias2>: <attr1>, <attr2>, ... Does the new character match any existing one? ...

  59. [59]

    If so, assign a higher weight

    Whether the segment carries a key transition in the before-after relationship. If so, assign a higher weight

  60. [60]

    If so, assign a higher weight

    Whether the segment contains clear, verifiable, and non-substitutable action execution, character interaction, object transfer, state change, or task progress. If so, assign a higher weight

  61. [61]

    If so, assign a higher weight

    Whether the segment is more likely to reveal whether the model truly understands and follows the prompt. If so, assign a higher weight

  62. [62]

    If so, assign a lower weight

    Whether the segment is only repetition, setup, or closure. If so, assign a lower weight

  63. [63]

    When semantic conditions are similar, prioritize event development and climax stages, respecting narrative structure

  64. [64]

    segment_importance

    The weights should be discriminative. Requirements: - The output length must match the number of input segments. - Each score must be an integer from 1 to 100. - Do not evaluate video quality; analyze only the prompts themselves. - Output JSON only. Do not output explanations, Markdown, or any extra text. Output format: {"segment_importance":[int,int,int,...

  65. [65]

    The i-th element corresponds to the execution score of the i-th prompt segment

    ‘segment_scores‘ must be an array of length {num_segments}. The i-th element corresponds to the execution score of the i-th prompt segment

  66. [66]

    1 means completely inconsistent, and 100 means completely consistent

    All scores must be integers from 1 to 100. 1 means completely inconsistent, and 100 means completely consistent

  67. [67]

    the key frames of this segment

    By default, score according to "the key frames of this segment" and "the prompt of this segment". Do not add points based on beginning, middle, or ending position, narrative structure, or segment importance

  68. [68]

    Do not mechanically assign a low score only because the action magnitude in this segment is small

    If a segment prompt mainly describes continuation, maintenance, confirmation, observation, or still being in some state, you may use the state established in the previous segment to judge whether the state remains valid. Do not mechanically assign a low score only because the action magnitude in this segment is small

  69. [69]

    Scoring must be based on directly observable evidence: - Whether the explicit action occurs - Whether character interaction occurs - Whether object change, object handoff, or state transition occurs - Whether the segment prompt is executed at the correct time - If the segment is a continuation/maintenance prompt, whether it clearly preserves the previousl...

  70. [70]

    If only the characters, scene, or general atmosphere match, but the key action, interaction, object ownership, handoff receiving relation, or state change is unclear, ‘segment_score‘ should not exceed 50. For multi-person scenes, as long as character identity, interaction target, or object ownership is clearly uncertain, the score should not be high even ...

  71. [71]

    If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

    If there is almost no direct evidence supporting the segment prompt, ‘segment_score‘ should be in 1-20. If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

  72. [72]

    continuation holds

    Adjacent segments may have the same score, and large jumps are allowed. Do not output an increasing or decreasing score sequence merely to make it look smooth. However, do not automatically assign an extremely low score to a "continuation holds" segment only because later-segment action is weaker

  73. [73]

    ‘overall_score‘ indicates whether the whole video completes the key events in the correct temporal order and maintains later state 24 consistency. If the key action, main interaction, object handoff, or state transition clearly occurs and later segments do not obviously violate the prompt, a medium-high score is allowed even if later segments have weaker ...

  74. [74]

    segment_scores

    Assign 100 only when character identity, key action, object relation, and timing are all very clear and almost unambiguous. Output format: {"segment_scores":[int,int,...],"overall_score":int} Planner and judge output schemas.For metric aggregation, the LLM planner sees only the six prompt texts and returns {segment_importance: [w 1, . . . , w6]}, w i ∈ {1...