Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
Pith reviewed 2026-06-28 03:23 UTC · model grok-4.3
The pith
Echo-Infinity replaces fixed memory caches with learnable queries that evolve by attention and gating to support constant-cost infinite video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Echo-Infinity shows that learnable Memory Queries, updated by attention and a gating mechanism on evicted frames, can serve as an evolving memory that abstracts and compresses any-length history at constant cost while remaining jointly optimized with the video DiTs; the same queries also function as a reusable generation prior, and the accompanying Unified Relative RoPE Recipe eliminates the finite RoPE window constraint that previously limited autoregressive length.
What carries the argument
learnable Memory Query updated by attention and gating on frame eviction, forming an evolving memory that replaces handcrafted KV-cache schedules
If this is right
- Video generation cost stays constant even as length grows to millions of frames, enabling real-time 24-hour rollouts.
- The queries improve quality even when used solely as an initial state without further updates.
- The Unified Relative RoPE Recipe removes the need for inference-time RoPE adaptation and closes the train-test length gap.
- The same memory mechanism supports both short-clip and long-video regimes at state-of-the-art quality.
Where Pith is reading between the lines
- If the queries truly preserve critical history, the approach could transfer to other autoregressive modalities that currently rely on sliding-window caches.
- Constant-cost memory suggests the same architecture could support live interactive video generation where new frames arrive continuously.
- The method implies that explicit compression schedules may be unnecessary once memory management is learned jointly with the generator.
Load-bearing premise
The Memory Queries can abstract and compress arbitrary-length history without critical information loss while staying optimized end-to-end with the DiTs.
What would settle it
A controlled rollout in which generation quality measurably declines after roughly 10,000 frames when using only the learned queries versus retaining full uncompressed history.
read the original abstract
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Echo-Infinity, an autoregressive framework for real-time infinite video generation. It replaces handcrafted KV-cache or heuristic compression with learnable Memory Queries that are updated via attention and gating upon frame eviction from the local window; these queries are optimized end-to-end with video DiTs to support arbitrary compression ratios at constant cost independent of sequence length. A Unified Relative RoPE Recipe is introduced that anchors sink frames at id 0 and caps the newest frame id at the pretrained maximum, eliminating the train-test RoPE extrapolation gap. The paper claims state-of-the-art results on long and short video generation tasks together with the first demonstration of 24-hour (>1.3 M frame) real-time rollouts.
Significance. If the central empirical claims hold, the work would be significant for enabling practical infinite video synthesis by addressing memory scaling and positional-encoding limits in diffusion transformers. The end-to-end optimization of an evolving memory prior and the explicit RoPE anchoring constitute concrete technical contributions that could generalize beyond the reported setting.
major comments (1)
- [Abstract] Abstract (central claim paragraph): the assertion that Memory Queries 'abstract and compress arbitrary-length history without critical information loss' while remaining end-to-end optimized is load-bearing for the 1.3 M-frame rollout result, yet the provided text supplies no direct measurement (e.g., reconstruction fidelity, information-retention metric, or ablation at extreme lengths) showing that the learned compression avoids compounding errors when training occurs only on finite clips.
minor comments (1)
- The abstract refers to 'Unified Relative RoPE Recipe' and 'RoPE fix' without stating the precise anchoring rule or the exact modification to the rotary embedding computation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the central claim in the abstract. We address the point below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (central claim paragraph): the assertion that Memory Queries 'abstract and compress arbitrary-length history without critical information loss' while remaining end-to-end optimized is load-bearing for the 1.3 M-frame rollout result, yet the provided text supplies no direct measurement (e.g., reconstruction fidelity, information-retention metric, or ablation at extreme lengths) showing that the learned compression avoids compounding errors when training occurs only on finite clips.
Authors: We agree that the manuscript would benefit from explicit quantitative support for the information-retention properties of the Memory Queries at lengths far beyond the training distribution. The current evidence rests on end-to-end optimization together with the observed stability of the 24-hour rollout; however, these do not constitute the direct reconstruction-fidelity or error-accumulation metrics requested. In the revised version we will add (i) a proxy reconstruction task that measures how faithfully the evolving memory reconstructs held-out frames after eviction at increasing horizons and (ii) an ablation plotting per-frame generation error accumulation up to the longest feasible sequence length permitted by available compute. These additions will be placed in the experiments section and referenced from the abstract. revision: yes
Circularity Check
No significant circularity; architectural proposal is self-contained.
full rationale
The paper proposes learnable Memory Queries updated via attention/gating and a Unified Relative RoPE Recipe as design choices, optimized end-to-end with DiTs and validated through empirical evaluation on video generation tasks. No mathematical derivation chain exists that reduces a claimed result (e.g., arbitrary-length compression without loss) to fitted inputs or self-citations by construction. No self-definitional steps, fitted predictions renamed as results, or load-bearing self-citations are present in the provided text. Performance on long rollouts is presented as an experimental outcome rather than a tautological equivalence.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Memory Query
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ReCamMaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. ReCamMaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
2025
-
[2]
Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025
Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, and Qiang Xu. Video-As-Prompt: Unified semantic control for video generation.arXiv preprint arXiv:2510.20888, 2025
arXiv 2025
-
[3]
MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls
Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, and Qiang Xu. MotionCraft: Crafting whole-body motion with plug-and-play multimodal controls. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1880–1888, 2025
2025
-
[4]
VideoPainter: Any-length video inpainting and editing with plug-and-play context control
Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, and Qiang Xu. VideoPainter: Any-length video inpainting and editing with plug-and-play context control. InProceedings of the Special Interest Group on Computer Graphics and Interactive TechniquesConference Papers, pages 1–12, 2025
2025
-
[5]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774, 2024
Pith/arXiv arXiv 2024
-
[6]
Diffusion forcing: Next-token prediction meets full-sequence diffusion
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024
2024
-
[7]
SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
Pith/arXiv arXiv 2025
-
[8]
Jintao Chen, Chengyu Bai, Junjun Hu, Xinda Xue, and Mu Xu. Grounded forcing: Bridging time-independent semantics and proximal dynamics in autoregressive video synthesis.arXiv preprint arXiv:2604.06939, 2026
Pith/arXiv arXiv 2026
-
[9]
Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, and Wenhu Chen. Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026
arXiv 2026
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[11]
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025
Pith/arXiv arXiv 2025
-
[12]
Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024
Pith/arXiv arXiv 2024
-
[13]
Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998
John DE Gabrieli. Cognitive neuroscience of human memory.Annual review of psychology, 49(1):87–115, 1998
1998
-
[14]
LTX-Video: Realtime video latent diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. LTX-Video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024
Pith/arXiv arXiv 2024
-
[15]
LTX-2: Efficient joint audio-visual foundation model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. LTX-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026
Pith/arXiv arXiv 2026
-
[16]
CLIPScore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021
2021
-
[17]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020
2020
-
[18]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvancesin Neural Information Processing Systems (NeurIPS), 2025
2025
-
[19]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. VBench: Comprehensive benchmark suite for video generative models. 12 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[20]
VBench++: Comprehensive and versatile benchmark suite for video generative models
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. VBench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[21]
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. MemFlow: Flowing adaptive memory for consistent and efficient long video narratives.arXiv preprint arXiv:2512.14699, 2025
arXiv 2025
-
[22]
Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025
arXiv 2025
-
[23]
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024
arXiv 2024
-
[24]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advancesin Neural Information Processing Systems, 35:26565–26577, 2022
2022
-
[25]
Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, and Peter A. Beerel. MemRoPE: Training-free infinite video generation via evolving memory tokens.arXiv preprint arXiv:2603.12513, 2026
arXiv 2026
-
[26]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[27]
Rolling forcing: Autoregressive long video diffusion in real time
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time. InICLR, 2026
2026
-
[28]
Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, and Kaipeng Zhang. PackForcing: Short video training suffices for long video sampling and long context inference.arXiv preprint arXiv:2603.25730, 2026
arXiv 2026
-
[29]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[30]
MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. MovieGen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
Pith/arXiv arXiv 2024
-
[31]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[32]
MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. MAGI-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
Pith/arXiv arXiv 2025
-
[33]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[34]
VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. VidProM: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advancesin Neural Information Processing Systems, 37:65618–65642, 2024
2024
-
[35]
Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
Pith/arXiv arXiv 2023
-
[36]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[37]
LongLive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying-Cong Chen, Yao Lu, Song Han, and Yukang Chen. LongLive: Real-time interactive long video generation. In ICLR, 2026
2026
-
[38]
Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, and Peng-Tao Jiang. Anchor forcing: Anchor memory and tri-region RoPE for interactive streaming video diffusion.arXiv preprint arXiv:2603.13405, 2026. 13
arXiv 2026
-
[39]
CogVideoX: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[40]
Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-RoPE: Action- controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025
arXiv 2025
-
[41]
Deep forcing: Training- free long video generation with deep sink and participative compression
Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, and Seungryong Kim. Deep forcing: Training- free long video generation with deep sink and participative compression. arXiv preprint arXiv:2512.05081, 2025
arXiv 2025
-
[42]
Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advancesin neural information processing systems, 37:47455–47487, 2024
2024
-
[43]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
2024
-
[44]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025
2025
-
[45]
Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, et al. VideoSSM: Autoregressive long video generation with hybrid state-space memory.arXiv preprint arXiv:2512.04519, 2025
arXiv 2025
-
[46]
Packing input frame context in next-frame prediction models for video generation
Lvmin Zhang and Maneesh Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025
arXiv 2025
-
[47]
Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. Pretraining frame preservation for lightweight autoregressive video history embedding.arXiv preprint arXiv:2512.23851, 2025
Pith/arXiv arXiv 2025
-
[48]
Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, and Ioannis Patras. Relax forcing: Relaxed kv-memory for consistent long video generation.arXiv preprint arXiv:2603.21366, 2026
arXiv 2026
-
[49]
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025
Pith/arXiv arXiv 2025
-
[50]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffu- sion distillation done right for high-quality real-time interactive video generation.arXiv preprintarXiv:2602.02214, 2026
Pith/arXiv arXiv 2026
-
[51]
Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, and Yansong Tang. Memorize-and-generate: Towards long-term consistency in real-time video generation.arXiv preprint arXiv:2512.18741, 2025. 14 In this appendix, we first provide additional qualitative visualizations of Echo-Infinity on30s,240s, and60s interactive long video generation (§A). We then detail...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.