SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Haoxue Wu; Jie Cao; Jie Ma; Jing Li; Jun Liang; Yang Han; Zhan Peng; Zhida Zhang

arxiv: 2605.27891 · v1 · pith:OTVORMSZnew · submitted 2026-05-27 · 💻 cs.CV · cs.AI

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Zhida Zhang , Jie Ma , Zhan Peng , Haoxue Wu , Yang Han , Jun Liang , Jie Cao , Jing Li This is my paper

Pith reviewed 2026-06-29 13:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationkeyframe conditioningnarrative pacingmulti-shot synthesiscinematic videogenerative modelstemporal control

0 comments

The pith

SmartDirector generates videos with controlled narrative pacing and structure by conditioning on multiple provided keyframes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video generation methods depend on sparse inputs such as text prompts or single start and end frames, which restricts fine control over how a story unfolds over time. SmartDirector addresses this by accepting multiple keyframes to guide both visual content and temporal pacing across shots. It uses a two-stage approach: first generating a low-resolution video from the keyframes, then refining it with high-resolution keyframes for detail. The system is trained on sequences extracted from movies to handle single-shot, multi-shot, and extension tasks, and experiments show better performance than prior approaches.

Core claim

SmartDirector is a framework that conditions video generation models on multiple keyframes to improve narrative quality and temporal pacing control. It consists of Director-Gen, which produces low-resolution videos from the keyframes, and Director-SR, which refines them using high-resolution keyframes as anchors. Training relies on a data pipeline that extracts single-shot and multi-shot sequences from movies, enabling scenarios like single-shot generation, multi-shot narrative synthesis, and video extension. Experiments indicate that this approach substantially outperforms existing state-of-the-art methods.

What carries the argument

The two-stage Director-Gen and Director-SR pipeline that generates and refines video conditioned on multiple keyframes, trained via a movie curation data pipeline.

If this is right

Enables single-shot generation with precise keyframe control.
Supports multi-shot narrative synthesis across different scenes.
Allows extension of existing videos while preserving pacing.
Achieves superior performance over state-of-the-art video generation techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining this keyframe method with language models could automate storyboarding from scripts.
The approach may extend to interactive video editing tools where users adjust keyframes in real time.
Longer video generation could benefit if the pacing control scales without accumulating errors.

Load-bearing premise

The data pipeline that curates single-shot and multi-shot sequences from movies provides sufficiently robust and unbiased training data for multi-keyframe conditioning without introducing artifacts in narrative pacing.

What would settle it

If side-by-side comparisons on narrative coherence metrics show no advantage for multi-keyframe conditioning over single-frame baselines, the benefit of the method would be called into question.

read the original abstract

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmartDirector's two-stage multi-keyframe setup plus movie curation pipeline is the actual addition, but the abstract's outperformance claim has no supporting numbers or ablations to evaluate.

read the letter

The paper introduces a two-stage model where Director-Gen produces low-resolution video from multiple keyframes and Director-SR then refines it using high-resolution versions of those same frames as anchors. It also describes a data pipeline that extracts single-shot and multi-shot sequences from movies to train for narrative pacing control, and it lists three use cases: single-shot generation, multi-shot synthesis, and video extension.

That combination of explicit multi-keyframe conditioning with a movie-derived training set is the concrete step beyond the sparse text-or-endpoint baselines mentioned. The architecture itself is described in enough detail to be reproducible in principle, and the flexible scenario support is a practical plus for anyone trying to steer longer narrative clips.

The main gap is that the abstract asserts substantial outperformance over SOTA without any metrics, baseline comparisons, or ablation results. The stress-test note correctly flags the curation pipeline as the load-bearing assumption; if shot-boundary detection or pacing consistency checks are not rigorous, the reported gains could simply reflect training distribution artifacts rather than better conditioning. No equations or derivations appear that would let a reader derive the improvement independently.

This is for researchers already working on controllable video generation for film or storytelling applications. A reader who needs a new conditioning trick to try on top of existing diffusion or autoregressive backbones could extract value from the architecture description alone.

I would send it to peer review. The core technical choices are clear enough that referees can check whether the experiments actually isolate the contribution of the keyframes and the data pipeline.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SmartDirector, a two-stage framework for keyframe-conditioned cinematic video generation. Director-Gen produces low-resolution videos from multiple input keyframes to control narrative structure and pacing; Director-SR then refines the output by using high-resolution keyframes as semantic anchors. The method supports single-shot generation, multi-shot narrative synthesis, and video extension. Training relies on a custom data pipeline that extracts single-shot and multi-shot sequences from movies. The abstract asserts that extensive experiments show substantial outperformance over existing state-of-the-art approaches.

Significance. If the empirical claims are substantiated with quantitative metrics, ablations, and controls, the work would address a recognized limitation of current video diffusion models—the lack of precise temporal and narrative control beyond sparse signals such as text or endpoint frames. The two-stage design and explicit support for multi-keyframe conditioning represent a practical engineering contribution. The stated intention to release code is a positive factor for reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (Data Pipeline): the central claim that SmartDirector learns genuine narrative pacing control from curated movie sequences rests on the unexamined assumption that the curation process supplies unbiased, pacing-consistent multi-keyframe examples. No description is supplied of keyframe selection criteria, shot-boundary detection method, or any consistency checks; without these details or an ablation isolating the curation step, the reported gains cannot be attributed to the conditioning mechanism rather than training-distribution artifacts.
[Abstract] Abstract: the assertion that SmartDirector 'substantially outperforms existing state-of-the-art approaches' is presented without any quantitative metrics, baseline comparisons, ablation tables, or error analysis. Because the headline claim is empirical, the absence of these results in the provided text renders the central contribution unverifiable.

minor comments (1)

[Abstract] Abstract: the sentence 'We will release the code' should be accompanied by a concrete statement of availability (e.g., GitHub link or supplementary material) to allow reviewers to assess reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to improve clarity and substantiation of claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Data Pipeline): the central claim that SmartDirector learns genuine narrative pacing control from curated movie sequences rests on the unexamined assumption that the curation process supplies unbiased, pacing-consistent multi-keyframe examples. No description is supplied of keyframe selection criteria, shot-boundary detection method, or any consistency checks; without these details or an ablation isolating the curation step, the reported gains cannot be attributed to the conditioning mechanism rather than training-distribution artifacts.

Authors: We agree that §3 provides insufficient detail on the data curation process. In the revised manuscript we will expand the description of the data pipeline to specify the shot-boundary detection algorithm, the exact keyframe selection criteria (including pacing consistency checks), and any filtering steps applied. We will also add an ablation that isolates the contribution of the curated multi-shot sequences versus a simpler random sampling baseline, allowing readers to attribute performance gains more precisely. revision: yes
Referee: [Abstract] Abstract: the assertion that SmartDirector 'substantially outperforms existing state-of-the-art approaches' is presented without any quantitative metrics, baseline comparisons, ablation tables, or error analysis. Because the headline claim is empirical, the absence of these results in the provided text renders the central contribution unverifiable.

Authors: The full manuscript contains quantitative results, baseline comparisons, and ablation studies in the Experiments section. However, the abstract as written does not reference these metrics. We will revise the abstract to include a concise summary of the key quantitative improvements (e.g., FID, FVD, and user-study scores against listed baselines) and will ensure all supporting tables and error analyses are clearly cross-referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and data pipeline are externally validated by experiments

full rationale

The paper presents a two-stage generative framework (Director-Gen + Director-SR) and a movie-derived data curation pipeline, with performance claims resting on experimental comparisons to SOTA methods. No equations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the abstract or described structure. The central claims are not forced by construction from inputs; they depend on external empirical results and are therefore self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full manuscript would be needed to audit training losses, architectural choices, or data assumptions.

pith-pipeline@v0.9.1-grok · 5711 in / 980 out tokens · 34090 ms · 2026-06-29T13:44:53.559372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 19 canonical work pages · 11 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V. Kulikov, Y. Bitterman, Z. Melumian, and O. Bibi, “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Multishotmaster: A controllable multi-shot video generation framework,

Q. Wang, X. Shi, B. Li, W. Bian, Q. Liu, H. Lu, X. Wang, P. Wan, K. Gai, and X. Jia, “Multishotmaster: A controllable multi-shot video generation framework,”arXiv preprint arXiv:2512.03041, 2025

work page arXiv 2025
[5]

Kling-Omni Technical Report

K. Team, J. Chen, Y. Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, X. Hu, X. Hu, B. Jiang, F. Kong, H. Li, J. Li, Q. Li, S. Li, X. Li, Y. Li, J. Liang, B. Liao, Y. Liao, W. Lin, Q. Liu, X. Liu, Y. Liu, Y. Liu, S. Lu, H. Mao, Y. Mao, H. Ouyang, W. Qin, W. Shi, X. Shi, L. Su, H. Sun, P. Sun, P. Wan, C. Wang, C. Wang, M. Wang, Q. Wang, R. Wang, X...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Holocine: Holistic generation of cinematic multi-shot long video narratives,

Y. Meng, H. Ouyang, Y. Yu, Q. Wang, W. Wang, K. L. Cheng, H. Wang, Y. Li, C. Chen, Y. Zenget al., “Holocine: Holistic generation of cinematic multi-shot long video narratives,”arXiv preprint arXiv:2510.20822, 2025

work page arXiv 2025
[7]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,”OpenAI Technical Report, 2024. [Online]. Available: https://openai.com/research/video-generation-models-as-world-simulators

2024
[8]

Veo: Our most capable generative video model,

G. DeepMind., “Veo: Our most capable generative video model,”Google DeepMind Blog, 2024. [Online]. Available: https://deepmind.google/technologies/veo/

2024
[9]

Captain cinema: Towards short movie generation,

J. Xiao, C. Yang, L. Zhang, S. Cai, Y. Zhao, Y. Guo, G. Wetzstein, M. Agrawala, A. Yuille, and L. Jiang, “Captain cinema: Towards short movie generation,” inThe Fourteenth International Conference on Learning Representations, 2025

2025
[10]

Storyboard — Wikipedia, the free encyclopedia,

Wikipedia contributors, “Storyboard — Wikipedia, the free encyclopedia,” https://en.wikipedia.org/wiki/ Storyboard, 2026, [Online; accessed 6-May-2026]

2026
[11]

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Y. Liu, Y. Ren, A. Artola, Y. Hu, X. Cun, X. Zhao, A. Zhao, R. H. Chan, S. Zhang, R. Liuet al., “Pusa v1. 0: Surpassing wan-i2v with $500 training cost by vectorized timestep adaptation,”arXiv preprint arXiv:2507.16116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Dreamontage: Arbitrary frame-guided one-shot video generation,

J. Liu, J. Li, J. Deng, G. Li, S. Zhou, Z. Fang, S. Lao, Z. Deng, J. Zhu, T. Maet al., “Dreamontage: Arbitrary frame-guided one-shot video generation,”arXiv preprint arXiv:2512.21252, 2025

work page arXiv 2025
[13]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

2023
[15]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Ultravideo: High-quality uhd video dataset with comprehensive captions,

Z. Xue, J. Zhang, T. Hu, H. He, Y. Chen, Y. Cai, Y. Wang, C. Wang, Y. Liu, X. Liet al., “Ultravideo: High-quality uhd video dataset with comprehensive captions,”arXiv preprint arXiv:2506.13691, 2025. 10

work page arXiv 2025
[18]

Temporally coherent gans for video super-resolution (tecogan),

M. Chu, Y. Xie, L. Leal-Taixé, and N. Thuerey, “Temporally coherent gans for video super-resolution (tecogan),” arXiv preprint arXiv:1811.09393, vol. 1, no. 2, p. 3, 2018

work page arXiv 2018
[19]

Investigating tradeoffs in real-world video super-resolution,

K. C. Chan, S. Zhou, X. Xu, and C. C. Loy, “Investigating tradeoffs in real-world video super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5962–5971

2022
[20]

Seedvr: Seeding infinity in diffusion transformer towards generic video restoration,

J. Wang, Z. Lin, M. Wei, Y. Zhao, C. Yang, C. C. Loy, and L. Jiang, “Seedvr: Seeding infinity in diffusion transformer towards generic video restoration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 2161–2172

2025
[21]

Dove: Efficient one-step diffusion model for real-world video super-resolution,

Z. Chen, Z. Zou, K. Zhang, X. Su, X. Yuan, Y. Guo, and Y. Zhang, “Dove: Efficient one-step diffusion model for real-world video super-resolution,”arXiv preprint arXiv:2505.16239, 2025

work page arXiv 2025
[22]

Sparkvsr: Interactive video super- resolution via sparse keyframe propagation,

J. Yu, X. Gao, P. Verlani, A. Gadde, Y. Wang, B. Adsumilli, and Z. Tu, “Sparkvsr: Interactive video super- resolution via sparse keyframe propagation,”arXiv preprint arXiv:2603.16864, 2026

work page arXiv 2026
[23]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

2024
[24]

B. F. Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024

2024
[25]

Goku: Flow based video generative foundation models,

S. Chen, C. Ge, Y. Zhang, Y. Zhang, F. Zhu, H. Yang, H. Hao, H. Wu, Z. Lai, Y. Huet al., “Goku: Flow based video generative foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23516–23527

2025
[26]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Flashvsr: Towards real-time diffusion-based streaming video super-resolution,

J. Zhuang, S. Guo, X. Cai, X. Li, Y. Liu, C. Yuan, and T. Xue, “Flashvsr: Towards real-time diffusion-based streaming video super-resolution,”arXiv preprint arXiv:2510.12747, 2025

work page arXiv 2025
[29]

Autoshot: A short video dataset and state-of-the-art shot boundary detection,

W. Zhu, Y. Huang, X. Xie, W. Liu, J. Deng, D. Zhang, Z. Wang, and J. Liu, “Autoshot: A short video dataset and state-of-the-art shot boundary detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023

2023
[30]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

2025
[31]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Jimeng ai,

ByteDance, “Jimeng ai,” https://jimeng.jianying.com/, 2024. 11 6 Appendix: LLM Evaluation Protocol We provide the complete system prompt used for the Gemini-based evaluation below. The evaluator is instructed to perform blind visual analysis followed by prompt-consistency checking, outputting results in a structured JSON format. 12 Prompt For Instruction-...

2024

[1] [1]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

LTX-Video: Realtime Video Latent Diffusion

Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V. Kulikov, Y. Bitterman, Z. Melumian, and O. Bibi, “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Multishotmaster: A controllable multi-shot video generation framework,

Q. Wang, X. Shi, B. Li, W. Bian, Q. Liu, H. Lu, X. Wang, P. Wan, K. Gai, and X. Jia, “Multishotmaster: A controllable multi-shot video generation framework,”arXiv preprint arXiv:2512.03041, 2025

work page arXiv 2025

[5] [5]

Kling-Omni Technical Report

K. Team, J. Chen, Y. Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, X. Hu, X. Hu, B. Jiang, F. Kong, H. Li, J. Li, Q. Li, S. Li, X. Li, Y. Li, J. Liang, B. Liao, Y. Liao, W. Lin, Q. Liu, X. Liu, Y. Liu, Y. Liu, S. Lu, H. Mao, Y. Mao, H. Ouyang, W. Qin, W. Shi, X. Shi, L. Su, H. Sun, P. Sun, P. Wan, C. Wang, C. Wang, M. Wang, Q. Wang, R. Wang, X...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Holocine: Holistic generation of cinematic multi-shot long video narratives,

Y. Meng, H. Ouyang, Y. Yu, Q. Wang, W. Wang, K. L. Cheng, H. Wang, Y. Li, C. Chen, Y. Zenget al., “Holocine: Holistic generation of cinematic multi-shot long video narratives,”arXiv preprint arXiv:2510.20822, 2025

work page arXiv 2025

[7] [7]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,”OpenAI Technical Report, 2024. [Online]. Available: https://openai.com/research/video-generation-models-as-world-simulators

2024

[8] [8]

Veo: Our most capable generative video model,

G. DeepMind., “Veo: Our most capable generative video model,”Google DeepMind Blog, 2024. [Online]. Available: https://deepmind.google/technologies/veo/

2024

[9] [9]

Captain cinema: Towards short movie generation,

J. Xiao, C. Yang, L. Zhang, S. Cai, Y. Zhao, Y. Guo, G. Wetzstein, M. Agrawala, A. Yuille, and L. Jiang, “Captain cinema: Towards short movie generation,” inThe Fourteenth International Conference on Learning Representations, 2025

2025

[10] [10]

Storyboard — Wikipedia, the free encyclopedia,

Wikipedia contributors, “Storyboard — Wikipedia, the free encyclopedia,” https://en.wikipedia.org/wiki/ Storyboard, 2026, [Online; accessed 6-May-2026]

2026

[11] [11]

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Y. Liu, Y. Ren, A. Artola, Y. Hu, X. Cun, X. Zhao, A. Zhao, R. H. Chan, S. Zhang, R. Liuet al., “Pusa v1. 0: Surpassing wan-i2v with $500 training cost by vectorized timestep adaptation,”arXiv preprint arXiv:2507.16116, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Dreamontage: Arbitrary frame-guided one-shot video generation,

J. Liu, J. Li, J. Deng, G. Li, S. Zhou, Z. Fang, S. Lao, Z. Deng, J. Zhu, T. Maet al., “Dreamontage: Arbitrary frame-guided one-shot video generation,”arXiv preprint arXiv:2512.21252, 2025

work page arXiv 2025

[13] [13]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

2023

[15] [15]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Ultravideo: High-quality uhd video dataset with comprehensive captions,

Z. Xue, J. Zhang, T. Hu, H. He, Y. Chen, Y. Cai, Y. Wang, C. Wang, Y. Liu, X. Liet al., “Ultravideo: High-quality uhd video dataset with comprehensive captions,”arXiv preprint arXiv:2506.13691, 2025. 10

work page arXiv 2025

[18] [18]

Temporally coherent gans for video super-resolution (tecogan),

M. Chu, Y. Xie, L. Leal-Taixé, and N. Thuerey, “Temporally coherent gans for video super-resolution (tecogan),” arXiv preprint arXiv:1811.09393, vol. 1, no. 2, p. 3, 2018

work page arXiv 2018

[19] [19]

Investigating tradeoffs in real-world video super-resolution,

K. C. Chan, S. Zhou, X. Xu, and C. C. Loy, “Investigating tradeoffs in real-world video super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5962–5971

2022

[20] [20]

Seedvr: Seeding infinity in diffusion transformer towards generic video restoration,

J. Wang, Z. Lin, M. Wei, Y. Zhao, C. Yang, C. C. Loy, and L. Jiang, “Seedvr: Seeding infinity in diffusion transformer towards generic video restoration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 2161–2172

2025

[21] [21]

Dove: Efficient one-step diffusion model for real-world video super-resolution,

Z. Chen, Z. Zou, K. Zhang, X. Su, X. Yuan, Y. Guo, and Y. Zhang, “Dove: Efficient one-step diffusion model for real-world video super-resolution,”arXiv preprint arXiv:2505.16239, 2025

work page arXiv 2025

[22] [22]

Sparkvsr: Interactive video super- resolution via sparse keyframe propagation,

J. Yu, X. Gao, P. Verlani, A. Gadde, Y. Wang, B. Adsumilli, and Z. Tu, “Sparkvsr: Interactive video super- resolution via sparse keyframe propagation,”arXiv preprint arXiv:2603.16864, 2026

work page arXiv 2026

[23] [23]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

2024

[24] [24]

B. F. Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024

2024

[25] [25]

Goku: Flow based video generative foundation models,

S. Chen, C. Ge, Y. Zhang, Y. Zhang, F. Zhu, H. Yang, H. Hao, H. Wu, Z. Lai, Y. Huet al., “Goku: Flow based video generative foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 23516–23527

2025

[26] [26]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Flashvsr: Towards real-time diffusion-based streaming video super-resolution,

J. Zhuang, S. Guo, X. Cai, X. Li, Y. Liu, C. Yuan, and T. Xue, “Flashvsr: Towards real-time diffusion-based streaming video super-resolution,”arXiv preprint arXiv:2510.12747, 2025

work page arXiv 2025

[29] [29]

Autoshot: A short video dataset and state-of-the-art shot boundary detection,

W. Zhu, Y. Huang, X. Xie, W. Liu, J. Deng, D. Zhang, Z. Wang, and J. Liu, “Autoshot: A short video dataset and state-of-the-art shot boundary detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023

2023

[30] [30]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

2025

[31] [31]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Jimeng ai,

ByteDance, “Jimeng ai,” https://jimeng.jianying.com/, 2024. 11 6 Appendix: LLM Evaluation Protocol We provide the complete system prompt used for the Gemini-based evaluation below. The evaluator is instructed to perform blind visual analysis followed by prompt-consistency checking, outputting results in a structured JSON format. 12 Prompt For Instruction-...

2024