CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation
Pith reviewed 2026-06-27 17:28 UTC · model grok-4.3
The pith
CineDance-1M dataset built with film-theory parsing and dual captioning trains models for aligned multi-shot audio-video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CineDance-1M supplies structured, dual-modal annotations for multi-shot long-form text-to-audio-video generation; its three-stage pipeline of diverse sourcing with cleansing, film-theory-inspired narrative parsing, and hierarchical dual-modal captioning yields clips that let an adapted LTX-2.3 model reach high single-modality fidelity, tight audio-video synchronization, and stable subject and scene consistency across shots.
What carries the argument
The three-stage curation pipeline that assembles CineDance-1M by combining broad video sourcing, narrative parsing informed by film theory, and hierarchical captioning for both audio and video tracks.
If this is right
- Models trained on the dataset can sustain subject and environment consistency across dozens of shots in a single generated clip.
- The structured annotations allow joint optimization of audio and video so that sound events match on-screen actions in multi-shot sequences.
- CineBench supplies a repeatable six-dimensional metric set that future work can use to compare narrative audio-video systems.
- Open-source generators become viable for story-length cinematic output once they are trained on data of this structural complexity.
Where Pith is reading between the lines
- The curation steps could be reused to enlarge existing video datasets with narrative-level labels rather than only short-clip descriptions.
- If the film-theory parsing step proves general, similar pipelines might improve training data for other sequential generation tasks such as music or dialogue synthesis.
- Longer average clip length in the dataset may encourage models to learn temporal planning that current short-clip training does not provide.
Load-bearing premise
The three-stage curation pipeline produces annotations and data quality superior to prior datasets for training multi-shot long-form joint audio-video models.
What would settle it
Training the same base model on CineDance-1M versus an existing dataset and finding no measurable gain in the six CineBench dimensions for alignment, narrative coherence, or subject consistency would undermine the claim that the new curation method is superior.
Figures
read the original abstract
The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CineDance-1M, a 1M-scale Text-to-Audio-Video dataset for multi-shot long-form generation, averaging 92.8 seconds and 24.2 shots per video. It is created using a three-stage curation pipeline of diverse sourcing and cleansing, film-theory-inspired narrative parsing, and hierarchical dual-modal captioning. The paper also introduces CineBench, a benchmark with diverse prompts and a six-dimensional human-aligned metric system. An adaptation of LTX-2.3 called CineDance is presented, which is claimed to achieve exceptional single-modality quality, precise audio-video alignment, and robust consistency, thereby validating the dataset and pipeline.
Significance. If the claims regarding the dataset quality and model performance hold, this would represent a notable contribution to the field by addressing the scarcity of high-quality training data for complex cinematic audio-video generation. The structured annotations and the proposed benchmark could facilitate more rigorous evaluation and training of open-source models in this area.
major comments (2)
- Abstract: The central claim that the three-stage curation pipeline produces data enabling 'exceptional' performance in the adapted CineDance model is not supported by any reported ablations, quantitative comparisons to prior T2AV datasets, or baseline results. Without such evidence, it is unclear whether the reported performance stems from the pipeline, the base model adaptation, or other factors.
- Abstract: No specific quantitative results, ablation studies, or direct comparisons are provided to verify the superiority of CineDance-1M over existing datasets or the contribution of each pipeline stage.
minor comments (2)
- Abstract: There is a typographical error: 'showremarkableabilitytogeneratecinematicnarratives' should include spaces as 'show remarkable ability to generate cinematic narratives'.
- Abstract: The description of CineBench's 'six-dimensional, human-aligned metric system' lacks details on what the dimensions are or how they are computed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical support in the abstract. We address each major comment below and commit to revisions that add the requested evidence without overstating current results.
read point-by-point responses
-
Referee: [—] Abstract: The central claim that the three-stage curation pipeline produces data enabling 'exceptional' performance in the adapted CineDance model is not supported by any reported ablations, quantitative comparisons to prior T2AV datasets, or baseline results. Without such evidence, it is unclear whether the reported performance stems from the pipeline, the base model adaptation, or other factors.
Authors: We agree that the abstract's phrasing requires supporting evidence to be fully substantiated. The manuscript reports qualitative and quantitative evaluations of CineDance on CineBench in Section 4, including consistency and alignment metrics relative to the base LTX-2.3 model. To directly address the concern, we will add explicit ablation studies isolating pipeline stages and quantitative comparisons against prior T2AV datasets in the revised manuscript, clarifying the sources of performance gains. revision: yes
-
Referee: [—] Abstract: No specific quantitative results, ablation studies, or direct comparisons are provided to verify the superiority of CineDance-1M over existing datasets or the contribution of each pipeline stage.
Authors: We acknowledge the absence of these elements in the current version. The abstract summarizes findings from the experimental section, but we will revise it to reference specific quantitative results and will incorporate ablation studies on each curation stage plus direct comparisons to existing datasets in the main text to demonstrate superiority and stage contributions. revision: yes
Circularity Check
No circularity: empirical dataset curation and benchmark contribution
full rationale
The paper presents CineDance-1M as a new T2AV dataset created via a three-stage curation pipeline, introduces CineBench, and reports that an adapted LTX-2.3 model (CineDance) performs well on it. No equations, fitted parameters, or predictions appear in the provided text. The central claim that the pipeline yields superior data is an empirical assertion supported only by the model's reported performance, without ablations or external comparisons, but this is a standard evidence gap rather than a logical reduction to self-definition, self-citation, or renaming. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The work is self-contained as a data and benchmark release with no derivation chain that collapses by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Film-theory-inspired narrative parsing produces annotations that improve training for multi-shot long-form generation.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1809.00496 (2018) 4
Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018) 4
Pith/arXiv arXiv 2018
-
[2]
arXiv preprint arXiv:2512.07802 (2025) 5, 14
An, Z., Jia, M., Qiu, H., Zhou, Z., Huang, X., Liu, Z., Ren, W., Kahatapitiya, K., Liu, D., He, S., et al.: On- estory: Coherent multi-shot video generation with adap- tive memory. arXiv preprint arXiv:2512.07802 (2025) 5, 14
arXiv 2025
-
[3]
In: Proceedings of the IEEE/CVF international conference on computer vision, pp
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1728–1738 (2021) 4
2021
-
[4]
Bordwell, D., Thompson, K., Smith, J.: Film art: An introduction, vol. 7. McGraw-Hill New York (2008) 8, 9
2008
-
[5]
arXiv preprint arXiv:2504.13074 (2025) 4, 6
Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels- v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 4, 6
Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2602.21818 (2026) 14
Chen, G., Lin, D., Yang, J., Zhang, Y., Fei, Z., Li, D., Chen, S., Ao, C., Pang, N., Wang, Y., et al.: Skyreels- v4: Multi-modal video-audio generation, inpainting and editing model. arXiv preprint arXiv:2602.21818 (2026) 14
arXiv 2026
-
[7]
In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE (2020) 4
2020
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Chen, T.S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.w., Jeon, B.E., Fang, Y., Lee, H.Y., Ren, J., Yang, M.H., et al.: Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13320–13331 (2024) 4
2024
-
[9]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Cheng, H.K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., Mitsufuji, Y.: Mmaudio: Taming multi- modal joint training for high-quality video-to-audio syn- thesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28901–28911 (2025) 5
2025
-
[10]
In: Asian conference on computer vision, pp
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision, pp. 251–263. Springer (2016) 7, 13
2016
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Addi- tive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 (2019) 13
2019
-
[12]
Assemblage (10), 111–131 (1989) 8
Eisenstein, S.M., Bois, Y.A., Glenny, M.: Montage and architecture. Assemblage (10), 111–131 (1989) 8
1989
-
[13]
BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5
Fan, F., Luo, C., Gao, W., Zhan, J.: Aigcbench: Compre- hensive evaluation of image-to-video content generated by ai. BenchCouncil Transactions on Benchmarks, Stan- dards and Evaluations3(4), 100152 (2023) 5
2023
-
[14]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15180–15190 (2023) 7, 13
2023
-
[15]
arXiv preprint arXiv:2505.04946 (2025) 5
Guo, X., Huo, J., Shi, Z., Song, Z., Zhang, J., Zhao, J.: T2vtextbench: A human evaluation benchmark for tex- tual control in video generation models. arXiv preprint arXiv:2505.04946 (2025) 5
arXiv 2025
-
[16]
arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19
HaCohen, Y., Brazowski, B., Chiprut, N., Bitterman, Y., Kvochko, A., Berkowitz, A., Shalem, D., Lifs- chitz, D., Moshe, D., Porat, E., et al.: Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233 (2026) 4, 5, 9, 13, 17, 19
Pith/arXiv arXiv 2026
-
[17]
arXiv preprint arXiv:2605.15199 (2026) 5
He, R., Wei, M., Yang, Z., Ordonez, V.: Entitybench: Towards entity-consistent long-range multi-shot video generation. arXiv preprint arXiv:2605.15199 (2026) 5
Pith/arXiv arXiv 2026
-
[18]
arXiv preprint arXiv:2509.22799 (2025) 5
He, X., Jiang, D., Nie, P., Liu, M., Jiang, Z., Su, M., Ma, W., Lin, J., Ye, C., Lu, Y., et al.: Videoscore2: Think before you score in generative video evaluation. arXiv preprint arXiv:2509.22799 (2025) 5
arXiv 2025
-
[19]
In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp
He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., et al.: Videoscore: Building automatic metrics to simulate fine- grained human feedback for video generation. In: Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2105–2123 (2024) 5, 13 CineDance: Towards Nex...
2024
-
[20]
arXiv preprint arXiv:2511.21579 (2025) 5
Hu, T., Yu, Z., Zhang, G., Su, Z., Zhou, Z., Zhang, Y., Zhou, Y., Lu, Q., Yi, R.: Harmony: Harmonizing audio and video generation through cross-task synergy. arXiv preprint arXiv:2511.21579 (2025) 5
arXiv 2025
-
[21]
arXiv preprint arXiv:2505.04512 (2025) 1
Hu, T., Yu, Z., Zhou, Z., Liang, S., Zhou, Y., Lin, Q., Lu, Q.: Hunyuancustom: A multimodal-driven architec- ture for customized video generation. arXiv preprint arXiv:2505.04512 (2025) 1
arXiv 2025
-
[22]
arXiv preprint arXiv:2604.06339 (2026) 1
Hu, T., Zhang, J., Huang, H., Yi, R., Su, Z., Weng, J., Xue, Z., Ma, L., Yang, M.H., Tao, D.: Evolution of video generative foundations. arXiv preprint arXiv:2604.06339 (2026) 1
Pith/arXiv arXiv 2026
-
[23]
arXiv preprint arXiv:2510.18775 (2025) 1
Hu, T., Zhang, J., Su, Z., Yi, R.: Ultragen: High- resolution video generation with hierarchical attention. arXiv preprint arXiv:2510.18775 (2025) 1
arXiv 2025
-
[24]
arXiv preprint arXiv:2512.09299 (2025) 5
Hua, D., Wang, X., Zeng, B., Huang, X., Liang, H., Niu, J., Chen, X., Xu, Q., Zhang, W.: Vabench: A compre- hensive benchmark for audio-video generation. arXiv preprint arXiv:2512.09299 (2025) 5
Pith/arXiv arXiv 2025
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807– 21818 (2024) 4, 5, 12
2024
-
[26]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5
Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4, 5
2025
-
[27]
GitHub repository (2020) 7
Jaided, A.: Easyocr: Ready-to-use ocr with 80+ sup- ported languages. GitHub repository (2020) 7
2020
-
[28]
arXiv preprint arXiv:2512.14699 (2025) 5
Ji, S., Chen, X., Yang, S., Tao, X., Wan, P., Zhao, H.: Memflow: Flowing adaptive memory for consistent and efficient long video narratives. arXiv preprint arXiv:2512.14699 (2025) 5
arXiv 2025
-
[29]
arXiv preprint arXiv:2510.18692 (2025) 5
Jia, W., Lu, Y., Huang, M., Wang, H., Huang, B., Chen, N., Liu, M., Jiang, J., Mao, Z.: Moga: Mixture-of-groups attention for end-to-end long video generation. arXiv preprint arXiv:2510.18692 (2025) 5
arXiv 2025
-
[30]
Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11
Ju, X., Gao, Y., Zhang, Z., Yuan, Z., Wang, X., Zeng, A., Xiong, Y., Xu, Q., Shan, Y.: Miradata: A large- scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems37, 48955–48970 (2024) 2, 4, 6, 7, 10, 11
2024
-
[31]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp
Kara, O., Singh, K.K., Liu, F., Ceylan, D., Rehg, J.M., Hinz, T.: Shotadapter: Text-to-multi-shot video genera- tion with diffusion models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference, pp. 28405–28415 (2025) 5
2025
-
[32]
In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp. 5148–5157 (2021) 12
2021
-
[33]
arXiv preprint arXiv:2412.03603 (2024) 1, 4
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuan- video: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1, 4
Pith/arXiv arXiv 2024
-
[34]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhu- manvid: A large-scale high-quality dataset for enhancing human-centric video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 7752–7762 (2025) 4, 6, 10, 18
2025
-
[35]
In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp
Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: Amt: All-pairs multi-field transforms for efficient frame interpolation. In: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition, pp. 9801–9810 (2023) 7, 12
2023
-
[36]
Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5
Liao, M., Lu, H., Zhang, X., Wan, F., Wang, T., Zhao, Y., Zuo, W., Ye, Q., Wang, J.: Evaluation of text-to-video generation models: A dynamics perspective. Advances in Neural Information Processing Systems37, 109790– 109816 (2024) 5
2024
-
[37]
arXiv preprint arXiv:2412.00131 (2024) 4
Lin, B., Ge, Y., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131 (2024) 4
Pith/arXiv arXiv 2024
-
[38]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A bench- mark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13087–13098 (2025) 5
2025
-
[39]
arXiv preprint arXiv:2503.23377 (2025) 5
Liu, K., Li, W., Chen, L., Wu, S., Zheng, Y., Ji, J., Zhou, F., Luo, J., Liu, Z., Fei, H., et al.: Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization. arXiv preprint arXiv:2503.23377 (2025) 5
arXiv 2025
-
[40]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22139– 22149 (2024) 5
2024
-
[41]
Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5
Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems36, 62352–62387 (2023) 4, 5
2023
-
[42]
arXiv preprint arXiv:1711.05101 (2017) 16
Loshchilov, I., Hutter, F.: Decoupled weight decay regu- larization. arXiv preprint arXiv:1711.05101 (2017) 16
Pith/arXiv arXiv 2017
-
[43]
arXiv preprint arXiv:2510.01284 (2025) 5, 9
Low, C., Wang, W., Katyal, C.: Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284 (2025) 5, 9
Pith/arXiv arXiv 2025
-
[44]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Luo, X., Li, Q., Liu, X., Qin, W., Yang, M., Wang, M., Wan, P., Zhang, D., Gai, K., Huang, S.L.: Filmweaver: Weaving consistent multi-shot videos with cache-guided autoregressive diffusion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 7689– 7697 (2026) 5
2026
-
[45]
arXiv preprint arXiv:2603.25746 (2026) 5
Luo, Y., Shi, X., Zhuang, J., Chen, Y., Liu, Q., Wang, X., Wan, P., Xue, T.: Shotstream: Streaming multi- shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746 (2026) 5
arXiv 2026
-
[46]
In: Proceedings of the 32nd ACM International Conference on Multimedia, pp
Mao, Y., Shen, X., Zhang, J., Qin, Z., Zhou, J., Xiang, M., Zhong, Y., Dai, Y.: Tavgbench: Benchmarking text to audible-video generation. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6607– 6616 (2024) 5
2024
-
[47]
arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19
Meng, Y., Ouyang, H., Yu, Y., Wang, Q., Wang, W., Cheng, K.L., Wang, H., Li, Y., Chen, C., Zeng, Y., et al.: Holocine: Holistic generation of cinematic multi-shot long video narratives. arXiv preprint arXiv:2510.20822 (2025) 5, 14, 17, 19
arXiv 2025
-
[48]
Communications8(1), 120–124 (1966) 8
Metz, C.: La grande syntagmatique du film narratif. Communications8(1), 120–124 (1966) 8
1966
-
[49]
In: Proceedings of the IEEE/CVF international conference on computer vision, pp
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640 (2019) 4, 10
2019
-
[50]
arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017) 4 22 Chen et al
arXiv 2017
-
[51]
arXiv preprint arXiv:2407.02371 (2024) 4, 10
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 4, 10
Pith/arXiv arXiv 2024
-
[52]
arXiv preprint arXiv:2304.07193 (2023) 13
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning ro- bust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 13
Pith/arXiv arXiv 2023
-
[53]
arXiv preprint arXiv:2603.24458 (2026) 5
Pan, K., Tian, Q., Zhang, J., Kong, W., Xiong, J., Long, Y., Zhang, S., Qiu, H., Wang, T., Lv, Z., et al.: Omniweaving: Towards unified video generation with free-form composition and reasoning. arXiv preprint arXiv:2603.24458 (2026) 5
arXiv 2026
-
[54]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp
Phung, Q., Mai, L., Heilbron, F.D.C., Liu, F., Huang, J.B., Ham, C.: Cineverse: Consistent keyframe synthesis for cinematic scene composition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2626–2636 (2026) 5
2026
-
[55]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Qi, T., Yuan, J., Feng, W., Fang, S., Liu, J., Zhou, S., He, Q., Xie, H., Zhang, Y.: Maskˆ 2dit: Dual mask- based diffusion transformer for multi-scene long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18837–18846 (2025) 5, 14, 17, 19
2025
-
[56]
In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp
Reddy, C.K., Gopal, V., Cutler, R.: Dnsmos: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021-2021 IEEE InternationalConferenceonAcoustics,SpeechandSignal Processing (ICASSP), pp. 6493–6497. IEEE (2021) 7
2021
-
[57]
arXiv preprint arXiv:2604.14148 (2026) 1, 14
Seedance, T., Chen, D., Chen, L., Chen, X., Chen, Y., Chen, Z., Chen, Z., Cheng, F., Cheng, T., Cheng, Y., et al.: Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148 (2026) 1, 14
Pith/arXiv arXiv 2026
-
[58]
arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5
Shi, H., Li, Y., Deng, N., Xu, Z., Chen, X., Wang, L., Hu, B., Zhang, M.: Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969 (2026) 3, 4, 5
arXiv 2026
-
[59]
In: Proceedings of the 32nd ACM International Conference on Multimedia, pp
Soucek, T., Lokoc, J.: Transnet v2: An effective deep network architecture for fast shot transition detection. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11218–11221 (2024) 3, 8
2024
-
[60]
Neurocomputing568, 127063 (2024) 5, 15
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024) 5, 15
2024
-
[61]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp
Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference, pp. 8406–8416 (2025) 4, 5
2025
-
[62]
arXiv preprint arXiv:2408.02629 (2024) 10
Tan, Z., Yang, X., Qin, L., Li, H.: Vidgen-1m: A large- scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629 (2024) 10
arXiv 2024
-
[63]
arXiv preprint arXiv:2602.08794 (2026) 9
Team, O., Yu, D., Chen, M., Chen, Q., Luo, Q., Wu, Q., Cheng, Q., Li, R., Liang, T., Zhang, W., et al.: Mova: Towards scalable and synchronized video-audio generation. arXiv preprint arXiv:2602.08794 (2026) 9
arXiv 2026
-
[64]
arXiv preprint arXiv:2502.05139 (2025) 12
Tjandra, A., Wu, Y.C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., et al.: Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. arXiv preprint arXiv:2502.05139 (2025) 12
Pith/arXiv arXiv 2025
-
[65]
arXiv preprint arXiv:2503.20314 (2025) 1, 4
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 4
Pith/arXiv arXiv 2025
-
[66]
arXiv preprint arXiv:2509.06155 (2025) 5
Wang, D., Zuo, W., Li, A., Chen, L.H., Liao, X., Zhou, D.,Yin,Z.,Dai,X.,Jiang,D., Yu,G.:Universe-1:Unified audio-video generation via stitching of experts. arXiv preprint arXiv:2509.06155 (2025) 5
arXiv 2025
-
[67]
arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19
Wang, Q., Shi, X., Li, B., Bian, W., Liu, Q., Lu, H., Wang, X., Wan, P., Gai, K., Jia, X.: Multishotmaster: A controllable multi-shot video generation framework. arXiv preprint arXiv:2512.03041 (2025) 5, 14, 17, 19
arXiv 2025
-
[68]
In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp
Wang, Q., Shi, Y., Ou, J., Chen, R., Lin, K., Wang, J., Jiang, B., Yang, H., Zheng, M., Tao, X., et al.: Koala- 36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 8428–8437 (2025) 4, 6, 10
2025
-
[69]
Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5
Wang, W., Yang, Y.: Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Advances in Neural Information Processing Systems37, 65618–65642 (2024) 5
2024
-
[70]
arXiv preprint arXiv:2503.01739 (2025) 4
Wang, W., Yang, Y.: Videoufo: A million-scale user- focused dataset for text-to-video generation. arXiv preprint arXiv:2503.01739 (2025) 4
arXiv 2025
-
[71]
arXiv preprint arXiv:2307.06942 (2023) 4, 13
Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large- scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023) 4, 13
Pith/arXiv arXiv 2023
-
[72]
arXiv preprint arXiv:2504.10317 (2025) 14
Wen, Y., Wu, J., Jain, A., Goldstein, T., Panda, A.: Analysis of attention in video diffusion transformers. arXiv preprint arXiv:2504.10317 (2025) 14
arXiv 2025
-
[73]
arXiv preprint arXiv:2511.18870 (2025) 4
Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025) 4
Pith/arXiv arXiv 2025
-
[74]
In: Proceedings of the IEEE/CVF international conference on computer vision, pp
Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 20144– 20154 (2023) 7
2023
-
[75]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Wu, W., Liu, M., Zhu, Z., Xia, X., Feng, H., Wang, W., Lin, K.Q., Shen, C., Shou, M.Z.: Moviebench: A hierarchical movie level dataset for long video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 28984–28994 (2025) 7
2025
-
[76]
arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19
Wu, W., Zhu, Z., Shou, M.Z.: Automated movie gen- eration via multi-agent cot planning. arXiv preprint arXiv:2503.07314 (2025) 5, 14, 17, 19
arXiv 2025
-
[77]
arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19
Wu, X., Gao, B., Qiao, Y., Wang, Y., Chen, X.: Cine- trans: Learning to generate videos with cinematic tran- sitions via masked diffusion models. arXiv preprint arXiv:2508.11484 (2025) 2, 5, 7, 10, 14, 17, 19
arXiv 2025
-
[78]
In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp
Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5. IEEE (2023) 7
2023
-
[79]
In: The Fourteenth International Conference on Learning Representations (2025) 5
Xiao, J., Yang, C., Zhang, L., Cai, S., Zhao, Y., Guo, Y., Wetzstein, G., Agrawala, M., Yuille, A., Jiang, L.: Captain cinema: Towards short movie generation. In: The Fourteenth International Conference on Learning Representations (2025) 5
2025
-
[80]
Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T.F., Ezzini, S.: Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework. arXiv preprint arXiv:2408.11788 (2024) 5 CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 23
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.