Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Haoran Li; Haoyang Huang; Jie Huang; Junhao Zhuang; Nan Duan; Qiang Xu; Shiyi Zhang; Songchun Zhang; Weiyang Jin; Yaowei Li

arxiv: 2606.04527 · v1 · pith:HTLMC3JQnew · submitted 2026-06-03 · 💻 cs.MM · cs.CV· cs.GR

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Yuxuan Bian , Zeyue Xue , Songchun Zhang , Shiyi Zhang , Weiyang Jin , Yaowei Li , Junhao Zhuang , Haoran Li

show 4 more authors

Jie Huang Haoyang Huang Nan Duan Qiang Xu

This is my paper

Pith reviewed 2026-06-28 03:23 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.GR

keywords infinite video generationautoregressive videoevolving memorymemory queriesvideo diffusion transformersreal-time video synthesisrelative positional encoding

0 comments

The pith

Echo-Infinity replaces fixed memory caches with learnable queries that evolve by attention and gating to support constant-cost infinite video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an autoregressive framework that maintains a fixed-size evolving memory for video diffusion transformers. Instead of predefined KV-cache rules or fixed compression ratios, it uses learnable Memory Queries that get updated through attention and gating whenever older frames leave the local window. These queries are trained end-to-end with the model, allowing arbitrary-length history to be compressed at constant computational cost regardless of video duration. A separate Unified Relative RoPE Recipe anchors sink frames at position zero and caps the newest frame ID at the model's pretrained maximum, removing train-test extrapolation gaps. The result is reported state-of-the-art quality on both short and long clips plus the first claimed 24-hour real-time rollouts exceeding 1.3 million frames.

Core claim

Echo-Infinity shows that learnable Memory Queries, updated by attention and a gating mechanism on evicted frames, can serve as an evolving memory that abstracts and compresses any-length history at constant cost while remaining jointly optimized with the video DiTs; the same queries also function as a reusable generation prior, and the accompanying Unified Relative RoPE Recipe eliminates the finite RoPE window constraint that previously limited autoregressive length.

What carries the argument

learnable Memory Query updated by attention and gating on frame eviction, forming an evolving memory that replaces handcrafted KV-cache schedules

If this is right

Video generation cost stays constant even as length grows to millions of frames, enabling real-time 24-hour rollouts.
The queries improve quality even when used solely as an initial state without further updates.
The Unified Relative RoPE Recipe removes the need for inference-time RoPE adaptation and closes the train-test length gap.
The same memory mechanism supports both short-clip and long-video regimes at state-of-the-art quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the queries truly preserve critical history, the approach could transfer to other autoregressive modalities that currently rely on sliding-window caches.
Constant-cost memory suggests the same architecture could support live interactive video generation where new frames arrive continuously.
The method implies that explicit compression schedules may be unnecessary once memory management is learned jointly with the generator.

Load-bearing premise

The Memory Queries can abstract and compress arbitrary-length history without critical information loss while staying optimized end-to-end with the DiTs.

What would settle it

A controlled rollout in which generation quality measurably declines after roughly 10,000 frames when using only the learned queries versus retaining full uncompressed history.

read the original abstract

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Learnable Memory Queries plus the RoPE anchoring trick are the actual new pieces, but the 1.3M-frame rollout claim rests on evidence that is still thin.

read the letter

Echo-Infinity replaces hand-crafted KV-cache schedules and heuristic compression with learnable Memory Queries that get updated by attention plus gating when frames leave the local window. The Unified Relative RoPE Recipe anchors sink frames at position 0 and keeps the newest frame id inside the pretrained temporal range. Both moves are presented as direct responses to the usual problems of information loss and train-test mismatch in autoregressive video diffusion.

The paper reports state-of-the-art numbers on standard long- and short-video benchmarks and shows real-time generation running for 24 hours. If those numbers hold up under scrutiny, the approach gives a concrete way to keep memory cost constant while still letting the model improve from the evolving queries.

The main open question is whether the queries actually retain what matters once the rollout exceeds the training clip length. Everything is optimized end-to-end on finite videos, so any gradual loss of detail or compounding drift at 1.3 million frames would not have been directly penalized during training. The abstract does not include ablations that measure retained information or error growth as a function of length, which leaves the extreme-scale claim harder to assess.

The work is aimed at groups already building autoregressive video models who need practical memory management. It is coherent enough and addresses a clear engineering bottleneck, so it should go to peer review even though the longest-rollout results will probably need extra validation.

Referee Report

1 major / 1 minor

Summary. The manuscript presents Echo-Infinity, an autoregressive framework for real-time infinite video generation. It replaces handcrafted KV-cache or heuristic compression with learnable Memory Queries that are updated via attention and gating upon frame eviction from the local window; these queries are optimized end-to-end with video DiTs to support arbitrary compression ratios at constant cost independent of sequence length. A Unified Relative RoPE Recipe is introduced that anchors sink frames at id 0 and caps the newest frame id at the pretrained maximum, eliminating the train-test RoPE extrapolation gap. The paper claims state-of-the-art results on long and short video generation tasks together with the first demonstration of 24-hour (>1.3 M frame) real-time rollouts.

Significance. If the central empirical claims hold, the work would be significant for enabling practical infinite video synthesis by addressing memory scaling and positional-encoding limits in diffusion transformers. The end-to-end optimization of an evolving memory prior and the explicit RoPE anchoring constitute concrete technical contributions that could generalize beyond the reported setting.

major comments (1)

[Abstract] Abstract (central claim paragraph): the assertion that Memory Queries 'abstract and compress arbitrary-length history without critical information loss' while remaining end-to-end optimized is load-bearing for the 1.3 M-frame rollout result, yet the provided text supplies no direct measurement (e.g., reconstruction fidelity, information-retention metric, or ablation at extreme lengths) showing that the learned compression avoids compounding errors when training occurs only on finite clips.

minor comments (1)

The abstract refers to 'Unified Relative RoPE Recipe' and 'RoPE fix' without stating the precise anchoring rule or the exact modification to the rotary embedding computation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the central claim in the abstract. We address the point below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (central claim paragraph): the assertion that Memory Queries 'abstract and compress arbitrary-length history without critical information loss' while remaining end-to-end optimized is load-bearing for the 1.3 M-frame rollout result, yet the provided text supplies no direct measurement (e.g., reconstruction fidelity, information-retention metric, or ablation at extreme lengths) showing that the learned compression avoids compounding errors when training occurs only on finite clips.

Authors: We agree that the manuscript would benefit from explicit quantitative support for the information-retention properties of the Memory Queries at lengths far beyond the training distribution. The current evidence rests on end-to-end optimization together with the observed stability of the 24-hour rollout; however, these do not constitute the direct reconstruction-fidelity or error-accumulation metrics requested. In the revised version we will add (i) a proxy reconstruction task that measures how faithfully the evolving memory reconstructs held-out frames after eviction at increasing horizons and (ii) an ablation plotting per-frame generation error accumulation up to the longest feasible sequence length permitted by available compute. These additions will be placed in the experiments section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal is self-contained.

full rationale

The paper proposes learnable Memory Queries updated via attention/gating and a Unified Relative RoPE Recipe as design choices, optimized end-to-end with DiTs and validated through empirical evaluation on video generation tasks. No mathematical derivation chain exists that reduces a claimed result (e.g., arbitrary-length compression without loss) to fitted inputs or self-citations by construction. No self-definitional steps, fitted predictions renamed as results, or load-bearing self-citations are present in the provided text. Performance on long rollouts is presented as an experimental outcome rather than a tautological equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of Memory Queries.

invented entities (1)

Memory Query no independent evidence
purpose: Dynamically filter, abstract, and compress history at constant cost
Introduced as the core learnable component updated by attention and gating

pith-pipeline@v0.9.1-grok · 5843 in / 1031 out tokens · 37896 ms · 2026-06-28T03:23:58.434857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 18 linked inside Pith

[1]

ReCamMaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCamMaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[2]

Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

arXiv 2025
[3]

MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls

Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, and Qiang Xu. MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1880–1888, 2025

2025
[4]

VideoPainter: Any-length video inpainting and editing with plug-and-play context control

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. VideoPainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Papers, pages 1–12, 2025

2025
[5]

Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024
[6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024
[7]

SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Pith/arXiv arXiv 2025
[8]

Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

Jintao Chen, Chengyu Bai, Junjun Hu, Xinda Xue, and Mu Xu. Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

Pith/arXiv arXiv 2026
[9]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

arXiv 2026
[10]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[11]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025
[12]

Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

Pith/arXiv arXiv 2024
[13]

Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998

John DE Gabrieli. Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998

1998
[14]

LTX-Video: Realtime video latent diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024
[15]

LTX-2: Efficient joint audio-visual foundation model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

Pith/arXiv arXiv 2026
[16]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021
[17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

2020
[18]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvancesin Neural Information Processing Systems (NeurIPS), 2025

2025
[19]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. 12 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[20]

VBench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[21]

MemFlow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. MemFlow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

arXiv 2025
[22]

LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

arXiv 2025
[23]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024
[24]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022

2022
[25]

Jay Kuo, and Peter A

Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Peter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

arXiv 2026
[26]

HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[27]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

2026
[28]

PackForcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026

arXiv 2026
[29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[30]

MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

Pith/arXiv arXiv 2024
[31]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[32]

MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025
[33]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[34]

VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024

2024
[35]

Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023
[36]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[37]

LongLive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. In ICLR, 2026

2026
[38]

Anchor forcing: Anchor memory and tri-region RoPE for interactive streaming video diffusion.arXiv preprint arXiv:2603.13405, 2026

Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Anchor forcing: Anchor memory and tri-region RoPE for interactive streaming video diffusion.arXiv preprint arXiv:2603.13405, 2026. 13

arXiv 2026
[39]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[40]

Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

arXiv 2025
[41]

Deep forcing: Training- free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

arXiv 2025
[42]

Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

2024
[43]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024
[44]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025
[45]

VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

arXiv 2025
[46]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025

arXiv 2025
[47]

Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2025

Pith/arXiv arXiv 2025
[48]

Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

arXiv 2026
[49]

VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025
[50]

Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Pith/arXiv arXiv 2026
[51]

Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025. 14 In this appendix, we first provide additional qualitative visualizations of Echo-Infinity on30s,240s, and60s interactive long video generation (§A). We then detail...

arXiv 2025

[1] [1]

ReCamMaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCamMaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025

[2] [2]

Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025

arXiv 2025

[3] [3]

MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls

Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, and Qiang Xu. MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1880–1888, 2025

2025

[4] [4]

VideoPainter: Any-length video inpainting and editing with plug-and-play context control

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. VideoPainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Papers, pages 1–12, 2025

2025

[5] [5]

Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024

Pith/arXiv arXiv 2024

[6] [6]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

2024

[7] [7]

SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

Pith/arXiv arXiv 2025

[8] [8]

Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

Jintao Chen, Chengyu Bai, Junjun Hu, Xinda Xue, and Mu Xu. Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026

Pith/arXiv arXiv 2026

[9] [9]

Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

arXiv 2026

[10] [10]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[11] [11]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025

[12] [12]

Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

Pith/arXiv arXiv 2024

[13] [13]

Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998

John DE Gabrieli. Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998

1998

[14] [14]

LTX-Video: Realtime video latent diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024

[15] [15]

LTX-2: Efficient joint audio-visual foundation model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

Pith/arXiv arXiv 2026

[16] [16]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

2021

[17] [17]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

2020

[18] [18]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvancesin Neural Information Processing Systems (NeurIPS), 2025

2025

[19] [19]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. 12 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[20] [20]

VBench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[21] [21]

MemFlow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. MemFlow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025

arXiv 2025

[22] [22]

LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

arXiv 2025

[23] [23]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024

[24] [24]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022

2022

[25] [25]

Jay Kuo, and Peter A

Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Peter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026

arXiv 2026

[26] [26]

HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[27] [27]

Rolling forcing: Autoregressive long video diffusion in real time

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026

2026

[28] [28]

PackForcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026

arXiv 2026

[29] [29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[30] [30]

MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

Pith/arXiv arXiv 2024

[31] [31]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[32] [32]

MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025

[33] [33]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[34] [34]

VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024

2024

[35] [35]

Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023

[36] [36]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[37] [37]

LongLive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. In ICLR, 2026

2026

[38] [38]

Anchor forcing: Anchor memory and tri-region RoPE for interactive streaming video diffusion.arXiv preprint arXiv:2603.13405, 2026

Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Anchor forcing: Anchor memory and tri-region RoPE for interactive streaming video diffusion.arXiv preprint arXiv:2603.13405, 2026. 13

arXiv 2026

[39] [39]

CogVideoX: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[40] [40]

Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

arXiv 2025

[41] [41]

Deep forcing: Training- free long video generation with deep sink and participative compression

Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025

arXiv 2025

[42] [42]

Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024

2024

[43] [43]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024

[44] [44]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025

[45] [45]

VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025

arXiv 2025

[46] [46]

Packing input frame context in next-frame prediction models for video generation

Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025

arXiv 2025

[47] [47]

Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2025

Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2025

Pith/arXiv arXiv 2025

[48] [48]

Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026

arXiv 2026

[49] [49]

VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025

[50] [50]

Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026

Pith/arXiv arXiv 2026

[51] [51]

Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025. 14 In this appendix, we first provide additional qualitative visualizations of Echo-Infinity on30s,240s, and60s interactive long video generation (§A). We then detail...

arXiv 2025