pith. machine review for the scientific record. sign in

arxiv: 2605.04461 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Stream-T1: Test-Time Scaling for Streaming Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords test-time scalingstreaming video generationdiffusion modelstemporal consistencynoise propagationreward pruningvideo synthesis
0
0 comments X

The pith

Stream-T1 shows that test-time scaling works efficiently for video generation when applied chunk-by-chunk in a streaming setup rather than to full sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current test-time scaling methods for diffusion-based video generation are too costly because they explore many candidates without temporal guidance. It argues that shifting to streaming video generation solves this by using short chunks with few denoising steps, which naturally support scaling while allowing historical context to guide new frames. The authors introduce Stream-T1 with three units that propagate high-quality noise from past chunks, prune candidates using combined short- and long-term rewards, and sink memory based on feedback to maintain coherence. A reader would care because this approach promises higher-quality videos without retraining expensive models, using only extra compute at inference time.

Core claim

Stream-T1 is a test-time scaling framework for streaming video generation built around three units: Stream-Scaled Noise Propagation refines the current chunk's initial noise using high-quality noise from prior chunks to establish temporal dependency; Stream-Scaled Reward Pruning scores candidates by balancing immediate spatial quality with sliding-window temporal coherence; and Stream-Scaled Memory Sinking routes evicted KV-cache context into update pathways guided by reward signals so past visuals anchor future frames. On 5-second and 30-second benchmarks this yields better temporal consistency, motion smoothness, and frame quality than prior methods.

What carries the argument

Three Stream-Scaled units (noise propagation from prior chunks, reward pruning across short- and long-term windows, and reward-guided memory sinking) that together turn chunk-level few-step synthesis into an efficient test-time scaling regime.

If this is right

  • Temporal dependency can be injected at test time by reusing proven prior-chunk noise instead of sampling fresh noise for every segment.
  • Candidate selection can trade off local frame aesthetics against global video coherence by combining immediate and sliding-window rewards.
  • KV-cache memory can be dynamically updated based on reward feedback so evicted context continues to guide later chunks without uniform overwriting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-wise scaling pattern could be tested on other autoregressive generation tasks such as long audio or conditional image sequences.
  • If the overhead of the three units stays low, the method might allow longer generated videos without proportional increases in training data or model size.
  • A direct measurement of wall-clock time versus quality gain on videos longer than 30 seconds would clarify whether the streaming advantage scales.

Load-bearing premise

That chunk-level synthesis with few denoising steps is naturally suited to test-time scaling and that the three units can be combined without creating new instabilities or excessive overhead.

What would settle it

A controlled experiment in which the same three units are applied to non-streaming full-video diffusion and produce equal or lower quality at higher total cost than standard test-time methods would falsify the claimed advantage of the streaming shift.

read the original abstract

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce Stream-T1, a comprehensive test-time scaling framework for streaming video generation. By focusing on chunk-level synthesis and few denoising steps, it proposes three units: Stream-Scaled Noise Propagation to refine initial latent noise using historical high-quality noise, Stream-Scaled Reward Pruning to balance local spatial aesthetics and global temporal coherence using short-term and sliding-window long-term evaluations, and Stream-Scaled Memory Sinking to route evicted KV-cache context based on reward feedback. The framework is evaluated on 5s and 30s video benchmarks, claiming profound superiority in temporal consistency, motion smoothness, and frame-level visual quality.

Significance. Should the quantitative results support the claims, this could represent a meaningful contribution to efficient test-time scaling in video diffusion models by exploiting the streaming paradigm to achieve better temporal control with lower overhead. The engineering of the three units to address specific bottlenecks in candidate exploration and temporal guidance is a targeted approach.

major comments (2)
  1. [Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.
  2. [Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.
minor comments (1)
  1. [Abstract] Inconsistent hyphenation in 'Stream -Scaled' (space before hyphen in some places); should be standardized to 'Stream-Scaled' throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.

    Authors: We agree that the abstract would benefit from greater specificity to make the performance claims more immediately verifiable. In the revised manuscript, we have updated the abstract to include concise quantitative highlights drawn from our experimental results (e.g., relative gains in temporal consistency and visual quality metrics versus baselines), while preserving brevity. Full details on candidate generation, scoring, error bars, and baseline comparisons are already present in Sections 3 and 4; the abstract revision simply surfaces the most salient numbers for readers. revision: yes

  2. Referee: [Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.

    Authors: We acknowledge that an explicit analysis of stability and error accumulation would strengthen the argument that chunk-level synthesis is particularly well-suited for test-time scaling. In the revised manuscript we have added a new subsection (with accompanying ablations and figures) that examines noise propagation drift and reward-feedback loop behavior across 5 s and 30 s sequences. The added experiments show that the memory-sinking mechanism prevents measurable accumulation of errors, thereby supporting the original claim without introducing new instabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering framework with no derivation chain

full rationale

The paper introduces Stream-T1 as a test-time scaling framework for streaming video generation, consisting of three descriptive units (noise propagation, reward pruning, memory sinking) motivated by the suitability of chunk-level synthesis. No equations, first-principles derivations, fitted parameters, or predictions are presented that could reduce to inputs by construction. Central claims rest on empirical benchmark evaluations (5s/30s videos) rather than any self-referential logic, self-citation load-bearing, or ansatz smuggling. The contribution is therefore self-contained as an engineering proposal without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no explicit mathematical derivations, fitted constants, or new postulated entities; the contribution is described entirely at the level of algorithmic components and empirical claims.

pith-pipeline@v0.9.0 · 5582 in / 1087 out tokens · 36010 ms · 2026-05-08T18:11:01.404507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 37 canonical work pages · 16 internal anchors

  1. [1]

    Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms

    Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms.arXiv preprint arXiv:2507.02076, 2025

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  4. [4]

    Autoregressive Video Gen- eration Without Vector Quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

  5. [5]

    arXiv preprint arXiv:2411.16375 (2024)

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

  6. [6]

    Long-context autoregressive video modeling with next-frame prediction

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325, 2025

  7. [7]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

  8. [8]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  9. [9]

    Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

    Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

  10. [10]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  11. [11]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  12. [12]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  14. [14]

    Videoar: Autoregressive video generation via next-frame & scale prediction.arXiv preprint arXiv:2601.05966, 2026

    Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Videoar: Autoregressive video generation via next-frame & scale prediction.arXiv preprint arXiv:2601.05966, 2026

  15. [15]

    Compositional image synthesis with inference-time scaling

    Minsuk Ji, Sanghyeok Lee, and Namhyuk Ahn. Compositional image synthesis with inference-time scaling. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4441–4445. IEEE, 2026

  16. [16]

    Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

  17. [17]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 12

  18. [18]

    Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

  19. [19]

    Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15657–15668, 2025

  20. [20]

    Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

  21. [21]

    Video-t1: Test-time scaling for video generation

    Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-t1: Test-time scaling for video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 18671–18681, 2025

  22. [22]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  23. [23]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  24. [24]

    Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

  25. [25]

    Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

  26. [26]

    Hpsv3: Towards wide-spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  27. [27]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

  28. [28]

    Inference-time text-to-video alignment with diffusion latent beam search.arXiv preprint arXiv:2501.19252, 2025

    Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search.arXiv preprint arXiv:2501.19252, 2025

  29. [29]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  30. [30]

    Hierarchical spatio-temporal decoupling for text-to-video generation

    Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hierarchical spatio-temporal decoupling for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6635–6645, 2024

  31. [31]

    Test-time scaling of diffusion models via noise trajectory search.arXiv preprint arXiv:2506.03164, 2025

    Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.arXiv preprint arXiv:2506.03164, 2025

  32. [32]

    arXiv preprint arXiv:2502.07737 (2025)

    Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737, 2025

  33. [33]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  34. [34]

    arXiv preprint arXiv:2501.06848 (2025)

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

  35. [35]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 13

  36. [36]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  37. [37]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

  38. [38]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023

  39. [39]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  40. [40]

    Art-v: Auto-regressive text-to-video generation with diffusion models

    Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024

  41. [41]

    Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints

    Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 10700–10708, 2026

  42. [42]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  43. [43]

    Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

    Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

  44. [44]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  45. [45]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  46. [46]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  47. [47]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

  48. [48]

    Videomar: Autoregressive video generatio with continuous tokens.arXiv preprint arXiv:2506.14168, 2025

    Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, and Feng Zhao. Videomar: Autoregressive video generatio with continuous tokens.arXiv preprint arXiv:2506.14168, 2025

  49. [49]

    Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective

    Hangjie Yuan, Weihua Chen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, et al. Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective. In The FourteenthInternational Conference on Learning Representations

  50. [50]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106, 2025

  51. [51]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025

  52. [52]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models.arXiv preprint arXiv:2503.24235, 2025. 14

  53. [53]

    Learning multi-dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024

  54. [54]

    Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026

    Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026. URL https://arxiv.org/abs/2603.14526

  55. [55]

    arXiv:2211.11018 , year=

    Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models.arXiv preprint arXiv:2211.11018, 2022

  56. [56]

    Golden noise for diffusion models: A learning framework

    Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025

  57. [57]

    From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning

    Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15329–15339, 2025. 15