CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Dacheng Tao; Jason Li; Jiangning Zhang; Lizhuang Ma; Qianyu Zhou; Qingdong He; Teng Hu; Yuheng Chen; Yuji Wang; Zhucun Xue

arxiv: 2606.09639 · v2 · pith:IJ6YYVJVnew · submitted 2026-06-08 · 💻 cs.CV

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Yuheng Chen , Teng Hu , Yuji Wang , Qingdong He , Zhucun Xue , Qianyu Zhou , Jason Li , Lizhuang Ma

show 2 more authors

Jiangning Zhang Dacheng Tao

This is my paper

Pith reviewed 2026-06-27 17:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords CineDance-1Mmulti-shot video generationtext-to-audio-videodataset curationlong-form cinematic generationaudio-video alignmentnarrative parsing

0 comments

The pith

CineDance-1M dataset built with film-theory parsing and dual captioning trains models for aligned multi-shot audio-video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the shortage of suitable training data that has held back open-source video models from producing long cinematic sequences with sound. It creates CineDance-1M, a collection of roughly one million clips whose average length is 92.8 seconds and that contain 24.2 continuous shots each. The data are assembled through sourcing and cleaning, narrative structure extraction drawn from film theory, and layered captioning that covers both picture and audio. An existing model is then retrained on the new set and evaluated with a custom benchmark that checks narrative quality across six human-aligned dimensions. If the approach succeeds, open-source systems could generate extended, story-coherent audio-video material without relying on proprietary collections.

Core claim

CineDance-1M supplies structured, dual-modal annotations for multi-shot long-form text-to-audio-video generation; its three-stage pipeline of diverse sourcing with cleansing, film-theory-inspired narrative parsing, and hierarchical dual-modal captioning yields clips that let an adapted LTX-2.3 model reach high single-modality fidelity, tight audio-video synchronization, and stable subject and scene consistency across shots.

What carries the argument

The three-stage curation pipeline that assembles CineDance-1M by combining broad video sourcing, narrative parsing informed by film theory, and hierarchical captioning for both audio and video tracks.

If this is right

Models trained on the dataset can sustain subject and environment consistency across dozens of shots in a single generated clip.
The structured annotations allow joint optimization of audio and video so that sound events match on-screen actions in multi-shot sequences.
CineBench supplies a repeatable six-dimensional metric set that future work can use to compare narrative audio-video systems.
Open-source generators become viable for story-length cinematic output once they are trained on data of this structural complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curation steps could be reused to enlarge existing video datasets with narrative-level labels rather than only short-clip descriptions.
If the film-theory parsing step proves general, similar pipelines might improve training data for other sequential generation tasks such as music or dialogue synthesis.
Longer average clip length in the dataset may encourage models to learn temporal planning that current short-clip training does not provide.

Load-bearing premise

The three-stage curation pipeline produces annotations and data quality superior to prior datasets for training multi-shot long-form joint audio-video models.

What would settle it

Training the same base model on CineDance-1M versus an existing dataset and finding no measurable gain in the six CineBench dimensions for alignment, narrative coherence, or subject consistency would undermine the claim that the new curation method is superior.

Figures

Figures reproduced from arXiv: 2606.09639 by Dacheng Tao, Jason Li, Jiangning Zhang, Lizhuang Ma, Qianyu Zhou, Qingdong He, Teng Hu, Yuheng Chen, Yuji Wang, Zhucun Xue.

**Figure 1.** Figure 1: CineDance-1M features 1M unprecedented long-form (92.8 s) and multi-shot (24.2 shots) audio-video sequences (above), paired with hierarchical structured captions for both modalities. Compared with typical Text-To-Video (T2V) datasets, it encompasses diverse narrative structures (below), meeting the growing demand for cinematic, narrative-driven joint generation. characters, objects, and scenes to remain vi… view at source ↗

**Figure 2.** Figure 2: Diagnostic examples illustrating two core chal [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Representative T2V and joint audio-video [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: CineDance-1M curation pipeline consists of three main stages: [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Statistical overview of the CineDance-1M dataset across multiple dimensions. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Comprehensive statistical overview of CineBench, illustrating its diverse taxonomic flow, rigorous quality [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Self-attention map visualization under the same [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Overall training schedule. 1) The background color shows the data-driven curriculum. 2) The three curves summarize DARC: the visual scaffold strength ηv decreases, the temporal-switch probability q(ηt) increases, and the reference-dropping probability pdrop is activated in the late stage. a placement-independent ordinal index. The switching probability q(ηt) increases as the temporal strength decreases, … view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of CineDance with baseline models. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Human alignment of the CineBench automatic [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CineDance-1M supplies a new open dataset for multi-shot long-form T2AV with film-theory annotations and a matching benchmark, but the abstract gives no ablations or dataset comparisons to back the claim that the three-stage pipeline is what drives better results.

read the letter

The main contribution here is the CineDance-1M dataset itself: roughly a million clips averaging 93 seconds and 24 shots, annotated jointly for audio and video through sourcing, film-theory narrative parsing, and hierarchical captioning. They also release CineBench with a six-dimensional evaluation setup and show an adapted LTX-2.3 model that reportedly handles alignment and consistency on their test prompts.

Releasing a large, structured, open T2AV resource focused on longer narrative sequences fills a practical gap that shorter-clip datasets leave open. The film-theory angle in the parsing step is a concrete design choice that could help with shot-level structure.

The soft spot is exactly what the stress-test flags: the abstract attributes strong single-modality quality and alignment to the curation pipeline, yet reports no ablations that remove individual stages and no head-to-head numbers against prior T2AV collections. Without those controls it is impossible to tell whether the gains come from the new data, the adaptation recipe, or prompt choices. If the full paper contains those experiments the concern shrinks; if not, the central claim stays untested.

This is useful reading for groups training open generative video models who need longer, multi-shot training material. The dataset and benchmark are concrete enough to justify sending the paper to referees, even if the validation section will need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CineDance-1M, a 1M-scale Text-to-Audio-Video dataset for multi-shot long-form generation, averaging 92.8 seconds and 24.2 shots per video. It is created using a three-stage curation pipeline of diverse sourcing and cleansing, film-theory-inspired narrative parsing, and hierarchical dual-modal captioning. The paper also introduces CineBench, a benchmark with diverse prompts and a six-dimensional human-aligned metric system. An adaptation of LTX-2.3 called CineDance is presented, which is claimed to achieve exceptional single-modality quality, precise audio-video alignment, and robust consistency, thereby validating the dataset and pipeline.

Significance. If the claims regarding the dataset quality and model performance hold, this would represent a notable contribution to the field by addressing the scarcity of high-quality training data for complex cinematic audio-video generation. The structured annotations and the proposed benchmark could facilitate more rigorous evaluation and training of open-source models in this area.

major comments (2)

Abstract: The central claim that the three-stage curation pipeline produces data enabling 'exceptional' performance in the adapted CineDance model is not supported by any reported ablations, quantitative comparisons to prior T2AV datasets, or baseline results. Without such evidence, it is unclear whether the reported performance stems from the pipeline, the base model adaptation, or other factors.
Abstract: No specific quantitative results, ablation studies, or direct comparisons are provided to verify the superiority of CineDance-1M over existing datasets or the contribution of each pipeline stage.

minor comments (2)

Abstract: There is a typographical error: 'showremarkableabilitytogeneratecinematicnarratives' should include spaces as 'show remarkable ability to generate cinematic narratives'.
Abstract: The description of CineBench's 'six-dimensional, human-aligned metric system' lacks details on what the dimensions are or how they are computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support in the abstract. We address each major comment below and commit to revisions that add the requested evidence without overstating current results.

read point-by-point responses

Referee: [—] Abstract: The central claim that the three-stage curation pipeline produces data enabling 'exceptional' performance in the adapted CineDance model is not supported by any reported ablations, quantitative comparisons to prior T2AV datasets, or baseline results. Without such evidence, it is unclear whether the reported performance stems from the pipeline, the base model adaptation, or other factors.

Authors: We agree that the abstract's phrasing requires supporting evidence to be fully substantiated. The manuscript reports qualitative and quantitative evaluations of CineDance on CineBench in Section 4, including consistency and alignment metrics relative to the base LTX-2.3 model. To directly address the concern, we will add explicit ablation studies isolating pipeline stages and quantitative comparisons against prior T2AV datasets in the revised manuscript, clarifying the sources of performance gains. revision: yes
Referee: [—] Abstract: No specific quantitative results, ablation studies, or direct comparisons are provided to verify the superiority of CineDance-1M over existing datasets or the contribution of each pipeline stage.

Authors: We acknowledge the absence of these elements in the current version. The abstract summarizes findings from the experimental section, but we will revise it to reference specific quantitative results and will incorporate ablation studies on each curation stage plus direct comparisons to existing datasets in the main text to demonstrate superiority and stage contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset curation and benchmark contribution

full rationale

The paper presents CineDance-1M as a new T2AV dataset created via a three-stage curation pipeline, introduces CineBench, and reports that an adapted LTX-2.3 model (CineDance) performs well on it. No equations, fitted parameters, or predictions appear in the provided text. The central claim that the pipeline yields superior data is an empirical assertion supported only by the model's reported performance, without ablations or external comparisons, but this is a standard evidence gap rather than a logical reduction to self-definition, self-citation, or renaming. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is self-contained as a data and benchmark release with no derivation chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on domain assumptions about what constitutes high-quality cinematic training data and the effectiveness of film-theory parsing for narrative structure.

axioms (1)

domain assumption Film-theory-inspired narrative parsing produces annotations that improve training for multi-shot long-form generation.
Invoked as part of the three-stage curation pipeline in the abstract.

pith-pipeline@v0.9.1-grok · 5822 in / 1187 out tokens · 33450 ms · 2026-06-27T17:28:29.409173+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 24 linked inside Pith

[1]

arXiv preprint arXiv:1809.00496 (2018) 4

Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) 4

Pith/arXiv arXiv 2018
[2]

arXiv preprint arXiv:2512.07802 (2025) 5, 14

An, Z., Jia, M., Qiu, H., Zhou, Z., Huang, X., Liu, Z., Ren, W., Kahatapitiya, K., Liu, D., He, S., et al.: On- estory: Coherent multi-shot video generation with adap- tive memory. arXiv preprint arXiv:2512.07802 (2025) 5, 14

arXiv 2025
[3]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1728–1738 (2021) 4

2021
[4]

Bordwell, D., Thompson, K., Smith, J.: Film art: An introduction, vol. 7. McGraw-Hill New York (2008) 8, 9

2008
[5]

arXiv preprint arXiv:2504.13074 (2025) 4, 6

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels- v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 4, 6

Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2602.21818 (2026) 14

Chen, G., Lin, D., Yang, J., Zhang, Y., Fei, Z., Li, D., Chen, S., Ao, C., Pang, N., Wang, Y., et al.: Skyreels- v4: Multi-modal video-audio generation, inpainting and editing model. arXiv preprint arXiv:2602.21818 (2026) 14

arXiv 2026
[7]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE (2020) 4

2020
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, T.S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.w., Jeon, B.E., Fang, Y., Lee, H.Y., Ren, J., Yang, M.H., et al.: Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331 (2024) 4

2024
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: Mmaudio: Taming multi- modal joint training for high-quality video-to-audio syn- thesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28901–28911 (2025) 5

2025
[10]

In: Asian conference on computer vision, pp

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision, pp. 251–263. Springer (2016) 7, 13

2016
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Addi- tive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019) 13

2019
[12]

Assemblage (10), 111–131 (1989) 8

Eisenstein, S.M., Bois, Y.A., Glenny, M.: Montage and architecture. Assemblage (10), 111–131 (1989) 8

1989
[13]

BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5

Fan, F., Luo, C., Gao, W., Zhan, J.: Aigcbench: Compre- hensive evaluation of image-to-video content generated by ai. BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5

2023
[14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15180–15190 (2023) 7, 13

2023
[15]

arXiv preprint arXiv:2505.04946 (2025) 5

Guo, X., Huo, J., Shi, Z., Song, Z., Zhang, J., Zhao, J.: T2vtextbench: A human evaluation benchmark for tex- tual control in video generation models. arXiv preprint arXiv:2505.04946 (2025) 5

arXiv 2025
[16]

arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifs- chitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19

Pith/arXiv arXiv 2026
[17]

arXiv preprint arXiv:2605.15199 (2026) 5

He, R., Wei, M., Yang, Z., Ordonez, V.: Entitybench: Towards entity-consistent long-range multi-shot video generation. arXiv preprint arXiv:2605.15199 (2026) 5

Pith/arXiv arXiv 2026
[18]

arXiv preprint arXiv:2509.22799 (2025) 5

He, X., Jiang, D., Nie, P., Liu, M., Jiang, Z., Su, M., Ma, W., Lin, J., Ye, C., Lu, Y., et al.: Videoscore2: Think before you score in generative video evaluation. arXiv preprint arXiv:2509.22799 (2025) 5

arXiv 2025
[19]

In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp

He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al.: Videoscore: Building automatic metrics to simulate fine- grained human feedback for video generation. In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2105–2123 (2024) 5, 13 CineDance: Towards Nex...

2024
[20]

arXiv preprint arXiv:2511.21579 (2025) 5

Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025) 5

arXiv 2025
[21]

arXiv preprint arXiv:2505.04512 (2025) 1

Hu, T., Yu, Z., Zhou, Z., Liang, S., Zhou, Y., Lin, Q., Lu, Q.: Hunyuancustom: A multimodal-driven architec- ture for customized video generation. arXiv preprint arXiv:2505.04512 (2025) 1

arXiv 2025
[22]

arXiv preprint arXiv:2604.06339 (2026) 1

Hu, T., Zhang, J., Huang, H., Yi, R., Su, Z., Weng, J., Xue, Z., Ma, L., Yang, M.H., Tao, D.: Evolution of video generative foundations. arXiv preprint arXiv:2604.06339 (2026) 1

Pith/arXiv arXiv 2026
[23]

arXiv preprint arXiv:2510.18775 (2025) 1

Hu, T., Zhang, J., Su, Z., Yi, R.: Ultragen: High- resolution video generation with hierarchical attention. arXiv preprint arXiv:2510.18775 (2025) 1

arXiv 2025
[24]

arXiv preprint arXiv:2512.09299 (2025) 5

Hua, D., Wang, X., Zeng, B., Huang, X., Liang, H., Niu, J., Chen, X., Xu, Q., Zhang, W.: Vabench: A compre- hensive benchmark for audio-video generation. arXiv preprint arXiv:2512.09299 (2025) 5

Pith/arXiv arXiv 2025
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807– 21818 (2024) 4, 5, 12

2024
[26]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5

2025
[27]

GitHub repository (2020) 7

Jaided, A.: Easyocr: Ready-to-use ocr with 80+ sup- ported languages. GitHub repository (2020) 7

2020
[28]

arXiv preprint arXiv:2512.14699 (2025) 5

Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adaptive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 5

arXiv 2025
[29]

arXiv preprint arXiv:2510.18692 (2025) 5

Jia, W., Lu, Y., Huang, M., Wang, H., Huang, B., Chen, N., Liu, M., Jiang, J., Mao, Z.: Moga: Mixture-of-groups attention for end-to-end long video generation. arXiv preprint arXiv:2510.18692 (2025) 5

arXiv 2025
[30]

Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11

Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., Shan, Y.: Miradata: A large- scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11

2024
[31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp

Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video genera- tion with diffusion models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp. 28405–28415 (2025) 5

2025
[32]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp. 5148–5157 (2021) 12

2021
[33]

arXiv preprint arXiv:2412.03603 (2024) 1, 4

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuan- video: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1, 4

Pith/arXiv arXiv 2024
[34]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhu- manvid: A large-scale high-quality dataset for enhancing human-centric video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 7752–7762 (2025) 4, 6, 10, 18

2025
[35]

In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp

Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: Amt: All-pairs multi-field transforms for efficient frame interpolation. In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp. 9801–9810 (2023) 7, 12

2023
[36]

Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5

Liao, M., Lu, H., Zhang, X., Wan, F., Wang, T., Zhao, Y., Zuo, W., Ye, Q., Wang, J.: Evaluation of text-to-video generation models: A dynamics perspective. Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5

2024
[37]

arXiv preprint arXiv:2412.00131 (2024) 4

Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 (2024) 4

Pith/arXiv arXiv 2024
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A bench- mark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13087–13098 (2025) 5

2025
[39]

arXiv preprint arXiv:2503.23377 (2025) 5

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025) 5

arXiv 2025
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22139– 22149 (2024) 5

2024
[41]

Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5

Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5

2023
[42]

arXiv preprint arXiv:1711.05101 (2017) 16

Loshchilov, I., Hutter, F.: Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101 (2017) 16

Pith/arXiv arXiv 2017
[43]

arXiv preprint arXiv:2510.01284 (2025) 5, 9

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284 (2025) 5, 9

Pith/arXiv arXiv 2025
[44]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Luo, X., Li, Q., Liu, X., Qin, W., Yang, M., Wang, M., Wan, P., Zhang, D., Gai, K., Huang, S.L.: Filmweaver: Weaving consistent multi-shot videos with cache-guided autoregressive diffusion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 7689– 7697 (2026) 5

2026
[45]

arXiv preprint arXiv:2603.25746 (2026) 5

Luo, Y., Shi, X., Zhuang, J., Chen, Y., Liu, Q., Wang, X., Wan, P., Xue, T.: Shotstream: Streaming multi- shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746 (2026) 5

arXiv 2026
[46]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Mao, Y., Shen, X., Zhang, J., Qin, Z., Zhou, J., Xiang, M., Zhong, Y., Dai, Y.: Tavgbench: Benchmarking text to audible-video generation. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6607– 6616 (2024) 5

2024
[47]

arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19

Meng, Y., Ouyang, H., Yu, Y., Wang, Q., Wang, W., Cheng, K.L., Wang, H., Li, Y., Chen, C., Zeng, Y., et al.: Holocine: Holistic generation of cinematic multi-shot long video narratives. arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19

arXiv 2025
[48]

Communications8(1), 120–124 (1966) 8

Metz, C.: La grande syntagmatique du film narratif. Communications8(1), 120–124 (1966) 8

1966
[49]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640 (2019) 4, 10

2019
[50]

arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al

Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al

arXiv 2017
[51]

arXiv preprint arXiv:2407.02371 (2024) 4, 10

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 4, 10

Pith/arXiv arXiv 2024
[52]

arXiv preprint arXiv:2304.07193 (2023) 13

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning ro- bust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 13

Pith/arXiv arXiv 2023
[53]

arXiv preprint arXiv:2603.24458 (2026) 5

Pan, K., Tian, Q., Zhang, J., Kong, W., Xiong, J., Long, Y., Zhang, S., Qiu, H., Wang, T., Lv, Z., et al.: Omniweaving: Towards unified video generation with free-form composition and reasoning. arXiv preprint arXiv:2603.24458 (2026) 5

arXiv 2026
[54]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Phung, Q., Mai, L., Heilbron, F.D.C., Liu, F., Huang, J.B., Ham, C.: Cineverse: Consistent keyframe synthesis for cinematic scene composition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2626–2636 (2026) 5

2026
[55]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Qi, T., Yuan, J., Feng, W., Fang, S., Liu, J., Zhou, S., He, Q., Xie, H., Zhang, Y.: Maskˆ 2dit: Dual mask- based diffusion transformer for multi-scene long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18837–18846 (2025) 5, 14, 17, 19

2025
[56]

In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp

Reddy, C.K., Gopal, V., Cutler, R.: Dnsmos: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp. 6493–6497. IEEE (2021) 7

2021
[57]

arXiv preprint arXiv:2604.14148 (2026) 1, 14

Seedance, T., Chen, D., Chen, L., Chen, X., Chen, Y., Chen, Z., Chen, Z., Cheng, F., Cheng, T., Cheng, Y., et al.: Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148 (2026) 1, 14

Pith/arXiv arXiv 2026
[58]

arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5

Shi, H., Li, Y., Deng, N., Xu, Z., Chen, X., Wang, L., Hu, B., Zhang, M.: Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5

arXiv 2026
[59]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Soucek, T., Lokoc, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11218–11221 (2024) 3, 8

2024
[60]

Neurocomputing568, 127063 (2024) 5, 15

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024) 5, 15

2024
[61]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp

Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 8406–8416 (2025) 4, 5

2025
[62]

arXiv preprint arXiv:2408.02629 (2024) 10

Tan, Z., Yang, X., Qin, L., Li, H.: Vidgen-1m: A large- scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629 (2024) 10

arXiv 2024
[63]

arXiv preprint arXiv:2602.08794 (2026) 9

Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: Mova: Towards scalable and synchronized video-audio generation. arXiv preprint arXiv:2602.08794 (2026) 9

arXiv 2026
[64]

arXiv preprint arXiv:2502.05139 (2025) 12

Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., et al.: Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025) 12

Pith/arXiv arXiv 2025
[65]

arXiv preprint arXiv:2503.20314 (2025) 1, 4

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 4

Pith/arXiv arXiv 2025
[66]

arXiv preprint arXiv:2509.06155 (2025) 5

Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D.,Yin,Z.,Dai,X.,Jiang,D., Yu,G.:Universe-1:Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025) 5

arXiv 2025
[67]

arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19

Wang, Q., Shi, X., Li, B., Bian, W., Liu, Q., Lu, H., Wang, X., Wan, P., Gai, K., Jia, X.: Multishotmaster: A controllable multi-shot video generation framework. arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19

arXiv 2025
[68]

In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala- 36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 8428–8437 (2025) 4, 6, 10

2025
[69]

Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5

Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5

2024
[70]

arXiv preprint arXiv:2503.01739 (2025) 4

Wang, W., Yang, Y.: Videoufo: A million-scale user- focused dataset for text-to-video generation. arXiv preprint arXiv:2503.01739 (2025) 4

arXiv 2025
[71]

arXiv preprint arXiv:2307.06942 (2023) 4, 13

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large- scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023) 4, 13

Pith/arXiv arXiv 2023
[72]

arXiv preprint arXiv:2504.10317 (2025) 14

Wen, Y., Wu, J., Jain, A., Goldstein, T., Panda, A.: Analysis of attention in video diffusion transformers. arXiv preprint arXiv:2504.10317 (2025) 14

arXiv 2025
[73]

arXiv preprint arXiv:2511.18870 (2025) 4

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025) 4

Pith/arXiv arXiv 2025
[74]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 20144– 20154 (2023) 7

2023
[75]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28984–28994 (2025) 7

2025
[76]

arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie gen- eration via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19

arXiv 2025
[77]

arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19

Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: Cine- trans: Learning to generate videos with cinematic tran- sitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19

arXiv 2025
[78]

In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp

Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5. IEEE (2023) 7

2023
[79]

In: The Fourteenth International Conference on Learning Representations (2025) 5

Xiao, J., Yang, C., Zhang, L., Cai, S., Zhao, Y., Guo, Y., Wetzstein, G., Agrawala, M., Yuille, A., Jiang, L.: Captain cinema: Towards short movie generation. In: The Fourteenth International Conference on Learning Representations (2025) 5

2025
[80]

arXiv preprint arXiv:2408.11788 (2024) 5 CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 23

Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T.F., Ezzini, S.: Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788 (2024) 5 CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 23

arXiv 2024

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:1809.00496 (2018) 4

Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) 4

Pith/arXiv arXiv 2018

[2] [2]

arXiv preprint arXiv:2512.07802 (2025) 5, 14

An, Z., Jia, M., Qiu, H., Zhou, Z., Huang, X., Liu, Z., Ren, W., Kahatapitiya, K., Liu, D., He, S., et al.: On- estory: Coherent multi-shot video generation with adap- tive memory. arXiv preprint arXiv:2512.07802 (2025) 5, 14

arXiv 2025

[3] [3]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1728–1738 (2021) 4

2021

[4] [4]

Bordwell, D., Thompson, K., Smith, J.: Film art: An introduction, vol. 7. McGraw-Hill New York (2008) 8, 9

2008

[5] [5]

arXiv preprint arXiv:2504.13074 (2025) 4, 6

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels- v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 4, 6

Pith/arXiv arXiv 2025

[6] [6]

arXiv preprint arXiv:2602.21818 (2026) 14

Chen, G., Lin, D., Yang, J., Zhang, Y., Fei, Z., Li, D., Chen, S., Ao, C., Pang, N., Wang, Y., et al.: Skyreels- v4: Multi-modal video-audio generation, inpainting and editing model. arXiv preprint arXiv:2602.21818 (2026) 14

arXiv 2026

[7] [7]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE (2020) 4

2020

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, T.S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.w., Jeon, B.E., Fang, Y., Lee, H.Y., Ren, J., Yang, M.H., et al.: Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331 (2024) 4

2024

[9] [9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: Mmaudio: Taming multi- modal joint training for high-quality video-to-audio syn- thesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28901–28911 (2025) 5

2025

[10] [10]

In: Asian conference on computer vision, pp

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision, pp. 251–263. Springer (2016) 7, 13

2016

[11] [11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Addi- tive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019) 13

2019

[12] [12]

Assemblage (10), 111–131 (1989) 8

Eisenstein, S.M., Bois, Y.A., Glenny, M.: Montage and architecture. Assemblage (10), 111–131 (1989) 8

1989

[13] [13]

BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5

Fan, F., Luo, C., Gao, W., Zhan, J.: Aigcbench: Compre- hensive evaluation of image-to-video content generated by ai. BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5

2023

[14] [14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15180–15190 (2023) 7, 13

2023

[15] [15]

arXiv preprint arXiv:2505.04946 (2025) 5

Guo, X., Huo, J., Shi, Z., Song, Z., Zhang, J., Zhao, J.: T2vtextbench: A human evaluation benchmark for tex- tual control in video generation models. arXiv preprint arXiv:2505.04946 (2025) 5

arXiv 2025

[16] [16]

arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19

HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifs- chitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19

Pith/arXiv arXiv 2026

[17] [17]

arXiv preprint arXiv:2605.15199 (2026) 5

He, R., Wei, M., Yang, Z., Ordonez, V.: Entitybench: Towards entity-consistent long-range multi-shot video generation. arXiv preprint arXiv:2605.15199 (2026) 5

Pith/arXiv arXiv 2026

[18] [18]

arXiv preprint arXiv:2509.22799 (2025) 5

He, X., Jiang, D., Nie, P., Liu, M., Jiang, Z., Su, M., Ma, W., Lin, J., Ye, C., Lu, Y., et al.: Videoscore2: Think before you score in generative video evaluation. arXiv preprint arXiv:2509.22799 (2025) 5

arXiv 2025

[19] [19]

In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp

He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al.: Videoscore: Building automatic metrics to simulate fine- grained human feedback for video generation. In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2105–2123 (2024) 5, 13 CineDance: Towards Nex...

2024

[20] [20]

arXiv preprint arXiv:2511.21579 (2025) 5

Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025) 5

arXiv 2025

[21] [21]

arXiv preprint arXiv:2505.04512 (2025) 1

Hu, T., Yu, Z., Zhou, Z., Liang, S., Zhou, Y., Lin, Q., Lu, Q.: Hunyuancustom: A multimodal-driven architec- ture for customized video generation. arXiv preprint arXiv:2505.04512 (2025) 1

arXiv 2025

[22] [22]

arXiv preprint arXiv:2604.06339 (2026) 1

Hu, T., Zhang, J., Huang, H., Yi, R., Su, Z., Weng, J., Xue, Z., Ma, L., Yang, M.H., Tao, D.: Evolution of video generative foundations. arXiv preprint arXiv:2604.06339 (2026) 1

Pith/arXiv arXiv 2026

[23] [23]

arXiv preprint arXiv:2510.18775 (2025) 1

Hu, T., Zhang, J., Su, Z., Yi, R.: Ultragen: High- resolution video generation with hierarchical attention. arXiv preprint arXiv:2510.18775 (2025) 1

arXiv 2025

[24] [24]

arXiv preprint arXiv:2512.09299 (2025) 5

Hua, D., Wang, X., Zeng, B., Huang, X., Liang, H., Niu, J., Chen, X., Xu, Q., Zhang, W.: Vabench: A compre- hensive benchmark for audio-video generation. arXiv preprint arXiv:2512.09299 (2025) 5

Pith/arXiv arXiv 2025

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807– 21818 (2024) 4, 5, 12

2024

[26] [26]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5

2025

[27] [27]

GitHub repository (2020) 7

Jaided, A.: Easyocr: Ready-to-use ocr with 80+ sup- ported languages. GitHub repository (2020) 7

2020

[28] [28]

arXiv preprint arXiv:2512.14699 (2025) 5

Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adaptive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 5

arXiv 2025

[29] [29]

arXiv preprint arXiv:2510.18692 (2025) 5

Jia, W., Lu, Y., Huang, M., Wang, H., Huang, B., Chen, N., Liu, M., Jiang, J., Mao, Z.: Moga: Mixture-of-groups attention for end-to-end long video generation. arXiv preprint arXiv:2510.18692 (2025) 5

arXiv 2025

[30] [30]

Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11

Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., Shan, Y.: Miradata: A large- scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11

2024

[31] [31]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp

Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video genera- tion with diffusion models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp. 28405–28415 (2025) 5

2025

[32] [32]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp. 5148–5157 (2021) 12

2021

[33] [33]

arXiv preprint arXiv:2412.03603 (2024) 1, 4

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuan- video: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1, 4

Pith/arXiv arXiv 2024

[34] [34]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhu- manvid: A large-scale high-quality dataset for enhancing human-centric video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 7752–7762 (2025) 4, 6, 10, 18

2025

[35] [35]

In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp

Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: Amt: All-pairs multi-field transforms for efficient frame interpolation. In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp. 9801–9810 (2023) 7, 12

2023

[36] [36]

Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5

Liao, M., Lu, H., Zhang, X., Wan, F., Wang, T., Zhao, Y., Zuo, W., Ye, Q., Wang, J.: Evaluation of text-to-video generation models: A dynamics perspective. Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5

2024

[37] [37]

arXiv preprint arXiv:2412.00131 (2024) 4

Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 (2024) 4

Pith/arXiv arXiv 2024

[38] [38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A bench- mark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13087–13098 (2025) 5

2025

[39] [39]

arXiv preprint arXiv:2503.23377 (2025) 5

Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025) 5

arXiv 2025

[40] [40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22139– 22149 (2024) 5

2024

[41] [41]

Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5

Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5

2023

[42] [42]

arXiv preprint arXiv:1711.05101 (2017) 16

Loshchilov, I., Hutter, F.: Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101 (2017) 16

Pith/arXiv arXiv 2017

[43] [43]

arXiv preprint arXiv:2510.01284 (2025) 5, 9

Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284 (2025) 5, 9

Pith/arXiv arXiv 2025

[44] [44]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Luo, X., Li, Q., Liu, X., Qin, W., Yang, M., Wang, M., Wan, P., Zhang, D., Gai, K., Huang, S.L.: Filmweaver: Weaving consistent multi-shot videos with cache-guided autoregressive diffusion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 7689– 7697 (2026) 5

2026

[45] [45]

arXiv preprint arXiv:2603.25746 (2026) 5

Luo, Y., Shi, X., Zhuang, J., Chen, Y., Liu, Q., Wang, X., Wan, P., Xue, T.: Shotstream: Streaming multi- shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746 (2026) 5

arXiv 2026

[46] [46]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Mao, Y., Shen, X., Zhang, J., Qin, Z., Zhou, J., Xiang, M., Zhong, Y., Dai, Y.: Tavgbench: Benchmarking text to audible-video generation. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6607– 6616 (2024) 5

2024

[47] [47]

arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19

Meng, Y., Ouyang, H., Yu, Y., Wang, Q., Wang, W., Cheng, K.L., Wang, H., Li, Y., Chen, C., Zeng, Y., et al.: Holocine: Holistic generation of cinematic multi-shot long video narratives. arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19

arXiv 2025

[48] [48]

Communications8(1), 120–124 (1966) 8

Metz, C.: La grande syntagmatique du film narratif. Communications8(1), 120–124 (1966) 8

1966

[49] [49]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640 (2019) 4, 10

2019

[50] [50]

arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al

Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al

arXiv 2017

[51] [51]

arXiv preprint arXiv:2407.02371 (2024) 4, 10

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 4, 10

Pith/arXiv arXiv 2024

[52] [52]

arXiv preprint arXiv:2304.07193 (2023) 13

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning ro- bust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 13

Pith/arXiv arXiv 2023

[53] [53]

arXiv preprint arXiv:2603.24458 (2026) 5

Pan, K., Tian, Q., Zhang, J., Kong, W., Xiong, J., Long, Y., Zhang, S., Qiu, H., Wang, T., Lv, Z., et al.: Omniweaving: Towards unified video generation with free-form composition and reasoning. arXiv preprint arXiv:2603.24458 (2026) 5

arXiv 2026

[54] [54]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Phung, Q., Mai, L., Heilbron, F.D.C., Liu, F., Huang, J.B., Ham, C.: Cineverse: Consistent keyframe synthesis for cinematic scene composition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2626–2636 (2026) 5

2026

[55] [55]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Qi, T., Yuan, J., Feng, W., Fang, S., Liu, J., Zhou, S., He, Q., Xie, H., Zhang, Y.: Maskˆ 2dit: Dual mask- based diffusion transformer for multi-scene long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18837–18846 (2025) 5, 14, 17, 19

2025

[56] [56]

In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp

Reddy, C.K., Gopal, V., Cutler, R.: Dnsmos: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp. 6493–6497. IEEE (2021) 7

2021

[57] [57]

arXiv preprint arXiv:2604.14148 (2026) 1, 14

Seedance, T., Chen, D., Chen, L., Chen, X., Chen, Y., Chen, Z., Chen, Z., Cheng, F., Cheng, T., Cheng, Y., et al.: Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148 (2026) 1, 14

Pith/arXiv arXiv 2026

[58] [58]

arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5

Shi, H., Li, Y., Deng, N., Xu, Z., Chen, X., Wang, L., Hu, B., Zhang, M.: Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5

arXiv 2026

[59] [59]

In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Soucek, T., Lokoc, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11218–11221 (2024) 3, 8

2024

[60] [60]

Neurocomputing568, 127063 (2024) 5, 15

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024) 5, 15

2024

[61] [61]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp

Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 8406–8416 (2025) 4, 5

2025

[62] [62]

arXiv preprint arXiv:2408.02629 (2024) 10

Tan, Z., Yang, X., Qin, L., Li, H.: Vidgen-1m: A large- scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629 (2024) 10

arXiv 2024

[63] [63]

arXiv preprint arXiv:2602.08794 (2026) 9

Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: Mova: Towards scalable and synchronized video-audio generation. arXiv preprint arXiv:2602.08794 (2026) 9

arXiv 2026

[64] [64]

arXiv preprint arXiv:2502.05139 (2025) 12

Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., et al.: Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025) 12

Pith/arXiv arXiv 2025

[65] [65]

arXiv preprint arXiv:2503.20314 (2025) 1, 4

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 4

Pith/arXiv arXiv 2025

[66] [66]

arXiv preprint arXiv:2509.06155 (2025) 5

Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D.,Yin,Z.,Dai,X.,Jiang,D., Yu,G.:Universe-1:Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025) 5

arXiv 2025

[67] [67]

arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19

Wang, Q., Shi, X., Li, B., Bian, W., Liu, Q., Lu, H., Wang, X., Wan, P., Gai, K., Jia, X.: Multishotmaster: A controllable multi-shot video generation framework. arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19

arXiv 2025

[68] [68]

In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp

Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala- 36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 8428–8437 (2025) 4, 6, 10

2025

[69] [69]

Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5

Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5

2024

[70] [70]

arXiv preprint arXiv:2503.01739 (2025) 4

Wang, W., Yang, Y.: Videoufo: A million-scale user- focused dataset for text-to-video generation. arXiv preprint arXiv:2503.01739 (2025) 4

arXiv 2025

[71] [71]

arXiv preprint arXiv:2307.06942 (2023) 4, 13

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large- scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023) 4, 13

Pith/arXiv arXiv 2023

[72] [72]

arXiv preprint arXiv:2504.10317 (2025) 14

Wen, Y., Wu, J., Jain, A., Goldstein, T., Panda, A.: Analysis of attention in video diffusion transformers. arXiv preprint arXiv:2504.10317 (2025) 14

arXiv 2025

[73] [73]

arXiv preprint arXiv:2511.18870 (2025) 4

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025) 4

Pith/arXiv arXiv 2025

[74] [74]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 20144– 20154 (2023) 7

2023

[75] [75]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28984–28994 (2025) 7

2025

[76] [76]

arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19

Wu, W., Zhu, Z., Shou, M.Z.: Automated movie gen- eration via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19

arXiv 2025

[77] [77]

arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19

Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: Cine- trans: Learning to generate videos with cinematic tran- sitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19

arXiv 2025

[78] [78]

In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp

Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5. IEEE (2023) 7

2023

[79] [79]

In: The Fourteenth International Conference on Learning Representations (2025) 5

Xiao, J., Yang, C., Zhang, L., Cai, S., Zhao, Y., Guo, Y., Wetzstein, G., Agrawala, M., Yuille, A., Jiang, L.: Captain cinema: Towards short movie generation. In: The Fourteenth International Conference on Learning Representations (2025) 5

2025

[80] [80]

arXiv preprint arXiv:2408.11788 (2024) 5 CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 23

Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T.F., Ezzini, S.: Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788 (2024) 5 CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 23

arXiv 2024