Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

Dingkang Liang; Jiangning Zhang; Jinzhuo Liu; Ran Yi; Wencan Jiang; Yabiao Wang; Yong Liu; Zhucun Xue

arxiv: 2605.18733 · v1 · pith:ZZB3NV3Wnew · submitted 2026-05-18 · 💻 cs.CV

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

Jinzhuo Liu , Jiangning Zhang , Wencan Jiang , Yabiao Wang , Dingkang Liang , Zhucun Xue , Ran Yi , Yong Liu This is my paper

Pith reviewed 2026-05-20 11:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords narrative video generationidentity consistencytraining-free frameworkentity trackinglong video synthesisLLM entity extractionVLM verificationvideo benchmark

0 comments

The pith

A training-free memory system assigns global IDs to story entities and verifies them via vision models to prevent identity drift in long videos with shifting prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of long-term inconsistency in autoregressive video generation, where evolving prompts cause identity drift, character duplication, and loss of attributes. It proposes IAMFlow as a training-free framework that uses an LLM to extract entities and assign unique global IDs from prompts, then employs a VLM for asynchronous attribute verification on generated frames. This explicit identity tracking replaces reliance on implicit attention signals or frame compression. The work also presents NarraStream-Bench, a new evaluation set of 324 multi-prompt scripts, and demonstrates faster inference through asynchronous verification and quantization. Experiments show the method leads performance while running quicker than prior baselines.

Core claim

IAMFlow explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. An LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching.

What carries the argument

IAMFlow identity-aware memory, which stores entities under unique global IDs extracted by LLM and refined by VLM verification on frames.

If this is right

Outperforms the strongest baseline by 2.56 points on NarraStream-Bench overall.
Achieves 1.39 times speedup over the most efficient baseline in the 60-second multi-prompt setting.
Handles shifting entity references in evolving prompts without identity drift or attribute loss.
Keeps generation practical via asynchronous visual verification, adaptive prompt transition, and model quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ID-assignment approach could transfer to other time-series generation tasks such as consistent character animation or multi-shot image sequences.
Removing the need for domain-specific training allows the technique to pair with any improving base video model without retraining.
Scaling tests on scripts with many simultaneous entities or durations beyond one minute would reveal whether verification errors grow with complexity.

Load-bearing premise

The approach assumes that off-the-shelf LLMs can reliably extract entities with visual attributes from evolving prompts and assign unique global IDs, and that VLMs can asynchronously verify and refine those attributes from rendered frames without accumulating errors that break long-term consistency.

What would settle it

Generate a 60-second sequence from a script that repeatedly refers to the same character with new names and conflicting descriptions; check whether the output frames maintain one consistent appearance or show duplicated or switched identities.

read the original abstract

Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IAMFlow gives a training-free pipeline for explicit entity tracking in long narrative videos via LLM global IDs and async VLM checks, plus a new benchmark, but the gains rest on unmeasured reliability of those off-the-shelf models.

read the letter

IAMFlow is a training-free framework that keeps identities consistent in long narrative video generation by having an LLM assign global IDs to entities extracted from prompts and using a VLM to asynchronously verify and refine attributes from the generated frames. The paper does a good job laying out the problem with existing methods that rely on compressed history or implicit attention, which struggle with shifting references in evolving prompts. The explicit modeling here is a clear step forward, and introducing NarraStream-Bench with 324 scripts and a three-dimensional evaluation protocol adds a useful tool for the field. The reported results show it outperforming the strongest baseline by 2.56 points overall and running 1.39 times faster in the 60-second setting, which suggests the acceleration pipeline with quantization and adaptive transitions is effective. Where it could be stronger is in the validation of the core components. The approach assumes reliable entity extraction and ID assignment from the LLM plus error-free refinement by the VLM, but without specific measurements of misassignment rates or drift over multiple prompt transitions, it's difficult to gauge how robust this is beyond the tested cases. The stress on prompt ambiguity could be a real issue if not addressed in the experiments. Readers working on video generation for entertainment or simulation will find this relevant, especially those looking for methods that don't require retraining large models. It has enough substance in the pipeline and benchmark to warrant serious referee attention rather than a desk reject. I would recommend sending this to peer review, with suggestions to include more failure case analysis and comparisons against other long-video consistency techniques.

Referee Report

3 major / 2 minor

Summary. The paper proposes IAMFlow, a training-free identity-aware memory framework for autoregressive narrative long video generation. It uses an LLM to extract entities with visual attributes from evolving multi-prompt scripts and assign unique global IDs, paired with asynchronous VLM verification and refinement of attributes from rendered frames to maintain long-term identity consistency. The framework incorporates an inference acceleration pipeline (asynchronous verification, adaptive transitions, quantization) and introduces NarraStream-Bench, a new benchmark with 324 multi-prompt scripts across six dimensions and a 3D evaluation protocol. Experiments claim IAMFlow achieves the best overall score on the benchmark (outperforming the strongest baseline by 2.56 points) and a 1.39× speedup in the 60-second multi-prompt setting.

Significance. If the central claims hold under rigorous validation, this work offers a practical training-free alternative to implicit attention or keyframe compression for mitigating identity drift, duplication, and attribute loss in long video generation. The explicit entity tracking via off-the-shelf LLMs/VLMs and the new NarraStream-Bench with multimodal LLM-based assessment represent potentially useful contributions to the field of controllable video synthesis.

major comments (3)

Methods (entity extraction and ID assignment): The central claim that explicit LLM-based global ID assignment and VLM refinement replace implicit similarity matching to prevent drift relies on the untested assumption that off-the-shelf LLMs reliably handle shifting entity references across prompt transitions without duplication or misassignment. No error-rate measurements, failure-case analysis, or ablation on prompt ambiguity are reported, making it impossible to attribute the 2.56-point gain specifically to the identity-aware memory.
Experiments and evaluation protocol: The reported 2.56-point overall improvement and 1.39× speedup on NarraStream-Bench are presented without per-dimension breakdowns, statistical significance tests, or analysis of cases where entity references evolve rapidly in the 60-second setting. This leaves open whether the gains are robust or concentrated in easier subsets of the 324 scripts.
Acceleration pipeline: The description of how asynchronous VLM verification, adaptive prompt transition, and quantization interact without introducing additional latency or consistency errors is high-level; a concrete timing breakdown or pseudocode would be needed to substantiate the speedup claim as load-bearing for practicality.

minor comments (2)

The abstract and introduction could more clearly distinguish the proposed explicit tracking from prior keyframe-retrieval or memory-compression baselines with a direct comparison table.
Notation for global IDs and attribute refinement steps would benefit from a small diagram or algorithm box to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: Methods (entity extraction and ID assignment): The central claim that explicit LLM-based global ID assignment and VLM refinement replace implicit similarity matching to prevent drift relies on the untested assumption that off-the-shelf LLMs reliably handle shifting entity references across prompt transitions without duplication or misassignment. No error-rate measurements, failure-case analysis, or ablation on prompt ambiguity are reported, making it impossible to attribute the 2.56-point gain specifically to the identity-aware memory.

Authors: We agree that additional analysis would help substantiate the reliability of the LLM-based entity extraction and ID assignment. While the VLM verification step is designed to detect and correct misassignments or duplications by refining attributes from rendered frames, we did not report quantitative error rates or specific ablations on prompt ambiguity in the original submission. The performance gains on NarraStream-Bench, which features scripts with shifting references across its six dimensions, provide indirect evidence of the framework's effectiveness. In the revision, we will include failure-case analysis and an ablation study examining the impact of prompt ambiguity on ID assignment accuracy. revision: yes
Referee: Experiments and evaluation protocol: The reported 2.56-point overall improvement and 1.39× speedup on NarraStream-Bench are presented without per-dimension breakdowns, statistical significance tests, or analysis of cases where entity references evolve rapidly in the 60-second setting. This leaves open whether the gains are robust or concentrated in easier subsets of the 324 scripts.

Authors: We acknowledge the value of per-dimension breakdowns and statistical tests for demonstrating robustness. The benchmark was constructed to include a variety of narrative complexities, including rapid entity evolution in the multi-prompt setting. However, the original manuscript focused on overall scores. We will revise to include per-dimension results, statistical significance analysis (e.g., paired t-tests or similar), and a dedicated discussion of performance on subsets with rapidly evolving references. revision: yes
Referee: Acceleration pipeline: The description of how asynchronous VLM verification, adaptive prompt transition, and quantization interact without introducing additional latency or consistency errors is high-level; a concrete timing breakdown or pseudocode would be needed to substantiate the speedup claim as load-bearing for practicality.

Authors: We will enhance the description of the inference acceleration pipeline in the revised manuscript. Specifically, we will add a timing breakdown table showing the latency contributions of each component (asynchronous verification, adaptive transitions, and quantization) and include pseudocode illustrating their integration to ensure no additional consistency errors are introduced while achieving the reported speedup. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pipeline built on external LLMs/VLMs

full rationale

The paper describes IAMFlow as a training-free framework that composes off-the-shelf LLMs for entity extraction and global ID assignment with VLMs for asynchronous attribute refinement from rendered frames. No equations, fitted parameters, or self-citations are presented that reduce the reported performance gains (e.g., 2.56-point improvement on NarraStream-Bench) to quantities defined by the method itself. The central claims rest on the reliability of external models rather than any self-definitional loop or renamed empirical pattern, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested reliability of pretrained LLMs and VLMs for entity extraction and verification in the context of video generation prompts and frames.

axioms (2)

domain assumption Off-the-shelf LLMs can accurately extract entities with visual attributes from evolving prompts and assign unique global IDs without error.
This step is required to build the identity-aware memory that replaces implicit attention mechanisms.
domain assumption VLMs can asynchronously verify and refine entity attributes from generated frames reliably enough to prevent drift over long sequences.
This verification is the mechanism that enables explicit tracking and is assumed to work without additional training.

pith-pipeline@v0.9.0 · 5821 in / 1474 out tokens · 43252 ms · 2026-05-20T11:19:06.067954+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 13 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024

Ollin Boer Bohan. Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024. 6

work page 2024
[3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 27, 28

work page 2021
[4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 1, 3

work page 2024
[5]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng 11 Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026. 3

work page arXiv 2026
[7]

Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025. 28

work page arXiv 2025
[8]

Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025

LightX2V Contributors. Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025. 6

work page 2025
[9]

Self-forcing++: Towards minute-scale high-quality video generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 3

work page 2026
[10]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 3

work page 2025
[11]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 3

work page 2025
[12]

A survey on long-video storytelling generation: architectures, consistency, and cinematic quality

Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Mohamed Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, et al. A survey on long-video storytelling generation: architectures, consistency, and cinematic quality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7023–7035, 2025. 1

work page 2025
[13]

Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025. 3

work page arXiv 2025
[14]

Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

Xiaokun Feng, Haiming Yu, Meiqi Wu, Shiyu Hu, Jintao Chen, Chen Zhu, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025. 3, 7

work page arXiv 2025
[15]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024. 7

work page 2024
[17]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577,

work page
[18]

Slowfast-vgen: Slow-fast learning for action-driven long video generation

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025
[19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 3, 7, 9, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818,

work page
[21]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3, 7, 8 12

work page 2025
[22]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025. 1, 2, 3, 4, 5, 7, 8, 9

work page arXiv 2025
[23]

Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025. 1, 3

work page arXiv 2025
[24]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025
[25]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6, 7

work page 2023
[26]

Amt: All-pairs multi-field transforms for efficient frame interpolation

Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023. 27

work page 2023
[27]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 3, 7, 9, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 1

work page 2023
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 27

work page 2021
[32]

The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904

C Spearman. The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904. 10

work page 1904
[33]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025. 7

work page 2025
[34]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020. 27, 28

work page 2020
[35]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

work page 2017
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025. 28

work page 2025
[39]

Moviebench: A hierarchical movie level dataset for long video generation

Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, and Mike Zheng Shou. Moviebench: A hierarchical movie level dataset for long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28984–28994, 2025. 3

work page 2025
[40]

Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025. 3 13

work page arXiv 2025
[41]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization. arXiv preprint arXiv:2602.02958, 2026. 5

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Progressive autoregressive video diffusion models

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6322–6332, 2025. 3

work page 2025
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

1, 3, 4, 5, 7, 8, 9, 29

work page
[47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In The Thirteenth International Conference on Learning Representations, 2025. 1, 3

work page 2025
[48]

Deep forcing: Training-free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 1, 7, 9, 29

work page arXiv 2025
[49]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025. 1, 3

work page 2025
[50]

Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025. 3

work page arXiv 2025
[51]

Packing input frame context in next-frame prediction models for video generation,

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation.CoRR, abs/2504.12626, 2025. 1, 3

work page arXiv 2025
[52]

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025. 3

work page arXiv 2025
[53]

Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

work page arXiv
[54]

another”, “new

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. InThe Twelfth International Conference on Learning Representations, 2024. 29 14 Appendix Contents A Identity-Aware Memory Implement...

work page 2024
[55]

entities

"entities": ONLY human/person characters (man, woman, protagonist, etc.) - Extract ONLY visual/physical attributes: hair, clothing, accessories, body type, age, skin, facial features - DO NOT extract behavioral states (walking, nodding, reading, sitting) or emotions (quiet, contemplative, happy) - Keep entity names short OUTPUT FORMAT (JSON object only, n...

work page
[56]

protagonist

Words like "protagonist", "main character", "he", "she" usually refer to previously introduced characters

work page
[57]

Matching clothing or appearance attributes indicates the same person

work page
[58]

another",

Words like "another", "other", "new", "different" indicate a NEW person - return null OUTPUT FORMAT (JSON only, no explanation): {"matched_id": <number or null>} [User message] New character description: "<entity>: <attr1>, <attr2>, ..." Existing characters: ID <gid>: <alias1>/<alias2>: <attr1>, <attr2>, ... Does the new character match any existing one? ...

work page
[59]

If so, assign a higher weight

Whether the segment carries a key transition in the before-after relationship. If so, assign a higher weight

work page
[60]

If so, assign a higher weight

Whether the segment contains clear, verifiable, and non-substitutable action execution, character interaction, object transfer, state change, or task progress. If so, assign a higher weight

work page
[61]

If so, assign a higher weight

Whether the segment is more likely to reveal whether the model truly understands and follows the prompt. If so, assign a higher weight

work page
[62]

If so, assign a lower weight

Whether the segment is only repetition, setup, or closure. If so, assign a lower weight

work page
[63]

When semantic conditions are similar, prioritize event development and climax stages, respecting narrative structure

work page
[64]

segment_importance

The weights should be discriminative. Requirements: - The output length must match the number of input segments. - Each score must be an integer from 1 to 100. - Do not evaluate video quality; analyze only the prompts themselves. - Output JSON only. Do not output explanations, Markdown, or any extra text. Output format: {"segment_importance":[int,int,int,...

work page
[65]

The i-th element corresponds to the execution score of the i-th prompt segment

‘segment_scores‘ must be an array of length {num_segments}. The i-th element corresponds to the execution score of the i-th prompt segment

work page
[66]

1 means completely inconsistent, and 100 means completely consistent

All scores must be integers from 1 to 100. 1 means completely inconsistent, and 100 means completely consistent

work page
[67]

the key frames of this segment

By default, score according to "the key frames of this segment" and "the prompt of this segment". Do not add points based on beginning, middle, or ending position, narrative structure, or segment importance

work page
[68]

Do not mechanically assign a low score only because the action magnitude in this segment is small

If a segment prompt mainly describes continuation, maintenance, confirmation, observation, or still being in some state, you may use the state established in the previous segment to judge whether the state remains valid. Do not mechanically assign a low score only because the action magnitude in this segment is small

work page
[69]

Scoring must be based on directly observable evidence: - Whether the explicit action occurs - Whether character interaction occurs - Whether object change, object handoff, or state transition occurs - Whether the segment prompt is executed at the correct time - If the segment is a continuation/maintenance prompt, whether it clearly preserves the previousl...

work page
[70]

If only the characters, scene, or general atmosphere match, but the key action, interaction, object ownership, handoff receiving relation, or state change is unclear, ‘segment_score‘ should not exceed 50. For multi-person scenes, as long as character identity, interaction target, or object ownership is clearly uncertain, the score should not be high even ...

work page
[71]

If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

If there is almost no direct evidence supporting the segment prompt, ‘segment_score‘ should be in 1-20. If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

work page
[72]

continuation holds

Adjacent segments may have the same score, and large jumps are allowed. Do not output an increasing or decreasing score sequence merely to make it look smooth. However, do not automatically assign an extremely low score to a "continuation holds" segment only because later-segment action is weaker

work page
[73]

‘overall_score‘ indicates whether the whole video completes the key events in the correct temporal order and maintains later state 24 consistency. If the key action, main interaction, object handoff, or state transition clearly occurs and later segments do not obviously violate the prompt, a medium-high score is allowed even if later segments have weaker ...

work page
[74]

segment_scores

Assign 100 only when character identity, key action, object relation, and timing are all very clear and almost unambiguous. Output format: {"segment_scores":[int,int,...],"overall_score":int} Planner and judge output schemas.For metric aggregation, the LLM planner sees only the six prompt texts and returns {segment_importance: [w 1, . . . , w6]}, w i ∈ {1...

work page

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024

Ollin Boer Bohan. Seraena: Wip pytorch code for stably training single-step, mode-dropping, deterministic autoencoders.https://github.com/madebyollin/seraena, 2024. 6

work page 2024

[3] [3]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 27, 28

work page 2021

[4] [4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024. 1, 3

work page 2024

[5] [5]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng 11 Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026. 3

work page arXiv 2026

[7] [7]

Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment.arXiv preprint arXiv:2510.11647, 2025. 28

work page arXiv 2025

[8] [8]

Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025

LightX2V Contributors. Lightx2v: Light video generation inference framework.https://github.com/ModelTC/ LightX2V, 2025. 6

work page 2025

[9] [9]

Self-forcing++: Towards minute-scale high-quality video generation

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 3

work page 2026

[10] [10]

One-minute video generation with test-time training

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, and Xiaolong Wang. One-minute video generation with test-time training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17702–17711, 2025. 3

work page 2025

[11] [11]

Autoregressive video generation without vector quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization. InThe Thirteenth International Conference on Learning Representations, 2025. 1, 3

work page 2025

[12] [12]

A survey on long-video storytelling generation: architectures, consistency, and cinematic quality

Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Mohamed Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, et al. A survey on long-video storytelling generation: architectures, consistency, and cinematic quality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7023–7035, 2025. 1

work page 2025

[13] [13]

Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025

Xueji Fang, Liyuan Ma, Zhiyang Chen, Mingyuan Zhou, and Guo-jun Qi. Inflvg: Reinforce inference-time consistent long video generation with grpo.arXiv preprint arXiv:2505.17574, 2025. 3

work page arXiv 2025

[14] [14]

Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025

Xiaokun Feng, Haiming Yu, Meiqi Wu, Shiyu Hu, Jintao Chen, Chen Zhu, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Narrlv: Towards a comprehensive narrative-centric evaluation for long video generation.arXiv preprint arXiv:2507.11245, 2025. 3, 7

work page arXiv 2025

[15] [15]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024. 7

work page 2024

[17] [17]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577,

work page

[18] [18]

Slowfast-vgen: Slow-fast learning for action-driven long video generation

Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, et al. Slowfast-vgen: Slow-fast learning for action-driven long video generation. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025

[19] [19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 3, 7, 9, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818,

work page

[21] [21]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3, 7, 8 12

work page 2025

[22] [22]

Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Memflow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025. 1, 2, 3, 4, 5, 7, 8, 9

work page arXiv 2025

[23] [23]

Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. Lovic: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025. 1, 3

work page arXiv 2025

[24] [24]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong MU, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. InThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025

[25] [25]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6, 7

work page 2023

[26] [26]

Amt: All-pairs multi-field transforms for efficient frame interpolation

Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023. 27

work page 2023

[27] [27]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025. 3, 7, 9, 29

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 1

work page 2023

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 27

work page 2021

[32] [32]

The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904

C Spearman. The proof and measurement of association between two things.The American Journal of Psychology, 15(1):72–101, 1904. 10

work page 1904

[33] [33]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025. 7

work page 2025

[34] [34]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020. 27, 28

work page 2020

[35] [35]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

work page 2017

[37] [37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025. 28

work page 2025

[39] [39]

Moviebench: A hierarchical movie level dataset for long video generation

Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, and Mike Zheng Shou. Moviebench: A hierarchical movie level dataset for long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28984–28994, 2025. 3

work page 2025

[40] [40]

Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. Pack and force your memory: Long-form and consistent video generation.arXiv preprint arXiv:2510.01784, 2025. 3 13

work page arXiv 2025

[41] [41]

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, et al. Quant videogen: Auto-regressive long video generation via 2-bit kv-cache quantization. arXiv preprint arXiv:2602.02958, 2026. 5

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Progressive autoregressive video diffusion models

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. Progressive autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6322–6332, 2025. 3

work page 2025

[44] [44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622,

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

1, 3, 4, 5, 7, 8, 9, 29

work page

[47] [47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In The Thirteenth International Conference on Learning Representations, 2025. 1, 3

work page 2025

[48] [48]

Deep forcing: Training-free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression.arXiv preprint arXiv:2512.05081, 2025. 1, 7, 9, 29

work page arXiv 2025

[49] [49]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025. 1, 3

work page 2025

[50] [50]

Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. Videossm: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025. 3

work page arXiv 2025

[51] [51]

Packing input frame context in next-frame prediction models for video generation,

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation.CoRR, abs/2504.12626, 2025. 1, 3

work page arXiv 2025

[52] [52]

Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation in autoregressive video memory compression.arXiv preprint arXiv:2512.23851, 2025. 3

work page arXiv 2025

[53] [53]

Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973,

work page arXiv

[54] [54]

another”, “new

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, WANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. InThe Twelfth International Conference on Learning Representations, 2024. 29 14 Appendix Contents A Identity-Aware Memory Implement...

work page 2024

[55] [55]

entities

"entities": ONLY human/person characters (man, woman, protagonist, etc.) - Extract ONLY visual/physical attributes: hair, clothing, accessories, body type, age, skin, facial features - DO NOT extract behavioral states (walking, nodding, reading, sitting) or emotions (quiet, contemplative, happy) - Keep entity names short OUTPUT FORMAT (JSON object only, n...

work page

[56] [56]

protagonist

Words like "protagonist", "main character", "he", "she" usually refer to previously introduced characters

work page

[57] [57]

Matching clothing or appearance attributes indicates the same person

work page

[58] [58]

another",

Words like "another", "other", "new", "different" indicate a NEW person - return null OUTPUT FORMAT (JSON only, no explanation): {"matched_id": <number or null>} [User message] New character description: "<entity>: <attr1>, <attr2>, ..." Existing characters: ID <gid>: <alias1>/<alias2>: <attr1>, <attr2>, ... Does the new character match any existing one? ...

work page

[59] [59]

If so, assign a higher weight

Whether the segment carries a key transition in the before-after relationship. If so, assign a higher weight

work page

[60] [60]

If so, assign a higher weight

Whether the segment contains clear, verifiable, and non-substitutable action execution, character interaction, object transfer, state change, or task progress. If so, assign a higher weight

work page

[61] [61]

If so, assign a higher weight

Whether the segment is more likely to reveal whether the model truly understands and follows the prompt. If so, assign a higher weight

work page

[62] [62]

If so, assign a lower weight

Whether the segment is only repetition, setup, or closure. If so, assign a lower weight

work page

[63] [63]

When semantic conditions are similar, prioritize event development and climax stages, respecting narrative structure

work page

[64] [64]

segment_importance

The weights should be discriminative. Requirements: - The output length must match the number of input segments. - Each score must be an integer from 1 to 100. - Do not evaluate video quality; analyze only the prompts themselves. - Output JSON only. Do not output explanations, Markdown, or any extra text. Output format: {"segment_importance":[int,int,int,...

work page

[65] [65]

The i-th element corresponds to the execution score of the i-th prompt segment

‘segment_scores‘ must be an array of length {num_segments}. The i-th element corresponds to the execution score of the i-th prompt segment

work page

[66] [66]

1 means completely inconsistent, and 100 means completely consistent

All scores must be integers from 1 to 100. 1 means completely inconsistent, and 100 means completely consistent

work page

[67] [67]

the key frames of this segment

By default, score according to "the key frames of this segment" and "the prompt of this segment". Do not add points based on beginning, middle, or ending position, narrative structure, or segment importance

work page

[68] [68]

Do not mechanically assign a low score only because the action magnitude in this segment is small

If a segment prompt mainly describes continuation, maintenance, confirmation, observation, or still being in some state, you may use the state established in the previous segment to judge whether the state remains valid. Do not mechanically assign a low score only because the action magnitude in this segment is small

work page

[69] [69]

Scoring must be based on directly observable evidence: - Whether the explicit action occurs - Whether character interaction occurs - Whether object change, object handoff, or state transition occurs - Whether the segment prompt is executed at the correct time - If the segment is a continuation/maintenance prompt, whether it clearly preserves the previousl...

work page

[70] [70]

If only the characters, scene, or general atmosphere match, but the key action, interaction, object ownership, handoff receiving relation, or state change is unclear, ‘segment_score‘ should not exceed 50. For multi-person scenes, as long as character identity, interaction target, or object ownership is clearly uncertain, the score should not be high even ...

work page

[71] [71]

If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

If there is almost no direct evidence supporting the segment prompt, ‘segment_score‘ should be in 1-20. If there is only weakly related, generic, or substitutable evidence, ‘segment_score‘ should usually be in 20-50, not high

work page

[72] [72]

continuation holds

Adjacent segments may have the same score, and large jumps are allowed. Do not output an increasing or decreasing score sequence merely to make it look smooth. However, do not automatically assign an extremely low score to a "continuation holds" segment only because later-segment action is weaker

work page

[73] [73]

‘overall_score‘ indicates whether the whole video completes the key events in the correct temporal order and maintains later state 24 consistency. If the key action, main interaction, object handoff, or state transition clearly occurs and later segments do not obviously violate the prompt, a medium-high score is allowed even if later segments have weaker ...

work page

[74] [74]

segment_scores

Assign 100 only when character identity, key action, object relation, and timing are all very clear and almost unambiguous. Output format: {"segment_scores":[int,int,...],"overall_score":int} Planner and judge output schemas.For metric aggregation, the LLM planner sees only the six prompt texts and returns {segment_importance: [w 1, . . . , w6]}, w i ∈ {1...

work page