arxiv: 2605.04461 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Stream-T1: Test-Time Scaling for Streaming Video Generation

Yijing Tu , Shaojin Wu , Mengqi Huang , Wenchuan Wang , Yuxin Wang , Chunxiao Liu , Zhendong Mao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords test-time scalingstreaming video generationdiffusion modelstemporal consistencynoise propagationreward pruningvideo synthesis

0 comments

The pith

Stream-T1 shows that test-time scaling works efficiently for video generation when applied chunk-by-chunk in a streaming setup rather than to full sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current test-time scaling methods for diffusion-based video generation are too costly because they explore many candidates without temporal guidance. It argues that shifting to streaming video generation solves this by using short chunks with few denoising steps, which naturally support scaling while allowing historical context to guide new frames. The authors introduce Stream-T1 with three units that propagate high-quality noise from past chunks, prune candidates using combined short- and long-term rewards, and sink memory based on feedback to maintain coherence. A reader would care because this approach promises higher-quality videos without retraining expensive models, using only extra compute at inference time.

Core claim

Stream-T1 is a test-time scaling framework for streaming video generation built around three units: Stream-Scaled Noise Propagation refines the current chunk's initial noise using high-quality noise from prior chunks to establish temporal dependency; Stream-Scaled Reward Pruning scores candidates by balancing immediate spatial quality with sliding-window temporal coherence; and Stream-Scaled Memory Sinking routes evicted KV-cache context into update pathways guided by reward signals so past visuals anchor future frames. On 5-second and 30-second benchmarks this yields better temporal consistency, motion smoothness, and frame quality than prior methods.

What carries the argument

Three Stream-Scaled units (noise propagation from prior chunks, reward pruning across short- and long-term windows, and reward-guided memory sinking) that together turn chunk-level few-step synthesis into an efficient test-time scaling regime.

If this is right

Temporal dependency can be injected at test time by reusing proven prior-chunk noise instead of sampling fresh noise for every segment.
Candidate selection can trade off local frame aesthetics against global video coherence by combining immediate and sliding-window rewards.
KV-cache memory can be dynamically updated based on reward feedback so evicted context continues to guide later chunks without uniform overwriting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunk-wise scaling pattern could be tested on other autoregressive generation tasks such as long audio or conditional image sequences.
If the overhead of the three units stays low, the method might allow longer generated videos without proportional increases in training data or model size.
A direct measurement of wall-clock time versus quality gain on videos longer than 30 seconds would clarify whether the streaming advantage scales.

Load-bearing premise

That chunk-level synthesis with few denoising steps is naturally suited to test-time scaling and that the three units can be combined without creating new instabilities or excessive overhead.

What would settle it

A controlled experiment in which the same three units are applied to non-streaming full-video diffusion and produce equal or lower quality at higher total cost than standard test-time methods would falsify the claimed advantage of the streaming shift.

read the original abstract

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stream-T1 adapts test-time scaling to streaming video via historical noise reuse, sliding-window reward pruning, and reward-driven KV-cache management, but the abstract supplies no numbers or baselines to support the superiority claims.

read the letter

The paper's main move is to treat streaming video generation as a natural fit for test-time scaling because it works in short chunks with limited denoising steps. They build three pieces around that: noise from prior high-quality chunks is fed forward to seed the next one, candidates are scored with both immediate and sliding-window rewards to balance local looks and longer coherence, and KV-cache eviction is routed based on those scores so useful context sticks around. That combination is not in the TTS or streaming papers they reference, so the integration itself is the concrete addition. It addresses a practical pain point—temporal drift in long outputs—without retraining the base model. The engineering looks straightforward and the motivation is clear from the chunk-level setup. The soft spot is the performance story. The abstract says the method shows profound superiority on 5s and 30s benchmarks for consistency, smoothness, and quality, yet it gives no scores, no error bars, no baseline comparisons, and no count of how many candidates are generated or pruned. Without those details the central claim cannot be checked from the text. The stress-test note is right that nothing in the description contradicts itself, but that does not substitute for evidence. This is for people already working on diffusion video models who need better inference-time coherence. A reader in that niche can extract the three units and try them, but the paper will not move the broader field until the numbers are visible. I would send it to peer review so the experiments can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce Stream-T1, a comprehensive test-time scaling framework for streaming video generation. By focusing on chunk-level synthesis and few denoising steps, it proposes three units: Stream-Scaled Noise Propagation to refine initial latent noise using historical high-quality noise, Stream-Scaled Reward Pruning to balance local spatial aesthetics and global temporal coherence using short-term and sliding-window long-term evaluations, and Stream-Scaled Memory Sinking to route evicted KV-cache context based on reward feedback. The framework is evaluated on 5s and 30s video benchmarks, claiming profound superiority in temporal consistency, motion smoothness, and frame-level visual quality.

Significance. Should the quantitative results support the claims, this could represent a meaningful contribution to efficient test-time scaling in video diffusion models by exploiting the streaming paradigm to achieve better temporal control with lower overhead. The engineering of the three units to address specific bottlenecks in candidate exploration and temporal guidance is a targeted approach.

major comments (2)

[Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.
[Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.

minor comments (1)

[Abstract] Inconsistent hyphenation in 'Stream -Scaled' (space before hyphen in some places); should be standardized to 'Stream-Scaled' throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] The abstract states that Stream-T1 'demonstrates profound superiority' on benchmarks but provides no quantitative numbers, error bars, baseline comparisons, or details on how candidates are generated and scored. This makes the central performance claim unverifiable from the given information and is load-bearing for the paper's main assertion.

Authors: We agree that the abstract would benefit from greater specificity to make the performance claims more immediately verifiable. In the revised manuscript, we have updated the abstract to include concise quantitative highlights drawn from our experimental results (e.g., relative gains in temporal consistency and visual quality metrics versus baselines), while preserving brevity. Full details on candidate generation, scoring, error bars, and baseline comparisons are already present in Sections 3 and 4; the abstract revision simply surfaces the most salient numbers for readers. revision: yes
Referee: [Proposed Units] The integration of the three units is asserted to enable stable low-overhead TTS without new instabilities, but no analysis, ablations, or discussion of potential error accumulation (e.g., in noise propagation over 30s sequences or reward feedback loops) is evident. This is load-bearing for the claim that chunk-level synthesis is intrinsically suited for TTS.

Authors: We acknowledge that an explicit analysis of stability and error accumulation would strengthen the argument that chunk-level synthesis is particularly well-suited for test-time scaling. In the revised manuscript we have added a new subsection (with accompanying ablations and figures) that examines noise propagation drift and reward-feedback loop behavior across 5 s and 30 s sequences. The added experiments show that the memory-sinking mechanism prevents measurable accumulation of errors, thereby supporting the original claim without introducing new instabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering framework with no derivation chain

full rationale

The paper introduces Stream-T1 as a test-time scaling framework for streaming video generation, consisting of three descriptive units (noise propagation, reward pruning, memory sinking) motivated by the suitability of chunk-level synthesis. No equations, first-principles derivations, fitted parameters, or predictions are presented that could reduce to inputs by construction. Central claims rest on empirical benchmark evaluations (5s/30s videos) rather than any self-referential logic, self-citation load-bearing, or ansatz smuggling. The contribution is therefore self-contained as an engineering proposal without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no explicit mathematical derivations, fitted constants, or new postulated entities; the contribution is described entirely at the level of algorithmic components and empirical claims.

pith-pipeline@v0.9.0 · 5582 in / 1087 out tokens · 36010 ms · 2026-05-08T18:11:01.404507+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/BranchSelection.lean branch_selection (no contact: VP-SDE interpolation, not RCL bilinear coupling) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

x_T^n = β x_T^{n−1} + √(1−β²) ε, ε ~ N(0,I) ... this interpolation guarantees that the marginal distribution of the noise remains strictly invariant, consistently adhering to the standard isotropic Gaussian N(0,I).
IndisputableMonolith (whole framework) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stream-T1 is composed of three units: Stream-Scaled Noise Propagation, Stream-Scaled Reward Pruning, Stream-Scaled Memory Sinking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 37 canonical work pages · 16 internal anchors

[1]

Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms

Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun, Soumyasundar Pal, Zhanguang Zhang, Yaochen Hu, Rohan Deepak Ajwani, Antonios Valkanas, Raika Karimi, et al. Reasoning on a budget: A survey of adaptive and controllable test-time compute in llms.arXiv preprint arXiv:2507.02076, 2025

work page arXiv 2025
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[3]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review arXiv 2025
[4]

Autoregressive Video Gen- eration Without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprintarXiv:2412.14169, 2024

work page arXiv 2024
[5]

arXiv preprint arXiv:2411.16375 (2024)

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

work page arXiv 2024
[6]

Long-context autoregressive video modeling with next-frame prediction

Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325, 2025

work page arXiv 2025
[7]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081): 633–638, 2025

2025
[8]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[9]

Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Ling Pan. Scaling image and video generation via test-time evolutionary search.arXiv preprint arXiv:2505.17618, 2025

work page arXiv 2025
[10]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[11]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[12]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Videoar: Autoregressive video generation via next-frame & scale prediction.arXiv preprint arXiv:2601.05966, 2026

Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Videoar: Autoregressive video generation via next-frame & scale prediction.arXiv preprint arXiv:2601.05966, 2026

work page arXiv 2026
[15]

Compositional image synthesis with inference-time scaling

Minsuk Ji, Sanghyeok Lee, and Namhyuk Ahn. Compositional image synthesis with inference-time scaling. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4441–4445. IEEE, 2026

2026
[16]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page arXiv 2023
[17]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 12

work page Pith review arXiv 2024
[18]

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li, Shaoteng Liu, Zhe Lin, and Manmohan Chandraker. Rolling sink: Bridging limited-horizon training and open-ended testing in autoregressive video diffusion.arXiv preprint arXiv:2602.07775, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15657–15668, 2025

2025
[20]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

work page arXiv 2024
[21]

Video-t1: Test-time scaling for video generation

Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, and Yueqi Duan. Video-t1: Test-time scaling for video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 18671–18681, 2025

2025
[22]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[23]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page arXiv 2025
[24]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

work page arXiv 2025
[25]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

work page arXiv 2025
[26]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[27]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20286–20332, 2025

2025
[28]

Inference-time text-to-video alignment with diffusion latent beam search.arXiv preprint arXiv:2501.19252, 2025

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Inference-time text-to-video alignment with diffusion latent beam search.arXiv preprint arXiv:2501.19252, 2025

work page arXiv 2025
[29]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[30]

Hierarchical spatio-temporal decoupling for text-to-video generation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hierarchical spatio-temporal decoupling for text-to-video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6635–6645, 2024

2024
[31]

Test-time scaling of diffusion models via noise trajectory search.arXiv preprint arXiv:2506.03164, 2025

Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.arXiv preprint arXiv:2506.03164, 2025

work page arXiv 2025
[32]

arXiv preprint arXiv:2502.07737 (2025)

Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-autoregressive modeling. arXiv preprint arXiv:2502.07737, 2025

work page arXiv 2025
[33]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[34]

arXiv preprint arXiv:2501.06848 (2025)

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848, 2025

work page arXiv 2025
[35]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. 13

2025
[36]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review arXiv 2025
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

work page Pith review arXiv 2025
[38]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review arXiv 2023
[39]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page Pith review arXiv 2024
[40]

Art-v: Auto-regressive text-to-video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. Art-v: Auto-regressive text-to-video generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024

2024
[41]

Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints

Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, and Kaiqi Huang. Imagerysearch: Adaptive test-time search for video generation beyond semantic dependency constraints. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 10700–10708, 2026

2026
[42]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[43]

Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11269–11277, 2026

2026
[44]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review arXiv 2021
[45]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Ying- cong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review arXiv 2025
[46]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[47]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025
[48]

Videomar: Autoregressive video generatio with continuous tokens.arXiv preprint arXiv:2506.14168, 2025

Hu Yu, Biao Gong, Hangjie Yuan, DanDan Zheng, Weilong Chai, Jingdong Chen, Kecheng Zheng, and Feng Zhao. Videomar: Autoregressive video generatio with continuous tokens.arXiv preprint arXiv:2506.14168, 2025

work page arXiv 2025
[49]

Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective

Hangjie Yuan, Weihua Chen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, et al. Lumos-1: On autoregressive video generation with discrete diffusion from a unified model perspective. In The FourteenthInternational Conference on Learning Representations
[50]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review arXiv 2025
[51]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation.International Journal of Computer Vision, 133(4):1879–1893, 2025

2025
[52]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models.arXiv preprint arXiv:2503.24235, 2025. 14

work page internal anchor Pith review arXiv 2025
[53]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024

2024
[54]

Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026

Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, and Ioannis Patras. Latsearch: Latent reward-guided search for faster inference-time scaling in video diffusion, 2026. URL https://arxiv.org/abs/2603.14526

work page arXiv 2026
[55]

arXiv:2211.11018 , year=

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models.arXiv preprint arXiv:2211.11018, 2022

work page arXiv 2022
[56]

Golden noise for diffusion models: A learning framework

Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025

2025
[57]

From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15329–15339, 2025. 15

2025