pith. sign in

arxiv: 2605.31590 · v1 · pith:AXFRYGEYnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Pith reviewed 2026-06-28 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multi-event video generationdiffusion transformertraining-free steeringtext-to-videodenoising trajectoryevent partitioningprompt fusionprogressive conditioning
0
0 comments X

The pith

Diffusion transformers contain turning points in denoising where text shifts from global layout to fine details, enabling a training-free steering method for multi-event videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper probes video diffusion transformers to locate intrinsic turning points in the denoising trajectory at which conditioning text predictably moves from shaping overall layout to controlling fine details. It then builds a progressive steering approach called TunerDiT that applies two handles at those points: event-partitioned masking to enforce boundaries while permitting transition bands, and cross-event prompt fusion to refine neighboring semantics late in the process. The method requires no extra training yet produces videos with stronger text alignment than prior training-free baselines, with the alignment gain growing as the number of distinct events increases. A reader would care because the approach exploits the model's own internal dynamics to handle long-horizon, multi-event generation without retraining costs.

Core claim

By identifying turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained details, TunerDiT applies event-partitioned masking and cross-event prompt fusion to achieve state-of-the-art results across eight metrics on multi-event text-to-video generation, offering a tunable consistency-separation trade-off and scaling improvement in text alignment as event count rises.

What carries the argument

Event-partitioned masking and cross-event prompt fusion, applied progressively at the probed turning points in the DiT denoising trajectory.

If this is right

  • State-of-the-art performance across eight metrics compared with other training-free multi-event methods.
  • A tunable trade-off between video consistency and event separation.
  • Text-alignment gains that increase with the number of events in the prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The turning-point observation could be tested for consistency across different diffusion transformer architectures beyond the specific models examined.
  • The progressive steering pattern might support generation of videos with substantially longer event sequences than those in the current benchmark.
  • The self-curated multi-event prompt suite could be expanded to include more varied event types or temporal relations to stress-test the method further.

Load-bearing premise

The denoising trajectory contains consistent turning points at which text conditioning predictably shifts from controlling global layout to fine details in a steerable way.

What would settle it

Running the two steering handles at the identified turning points on the Meve benchmark and measuring whether text-alignment scores fail to exceed those of other training-free baselines, especially as event count increases.

Figures

Figures reproduced from arXiv: 2605.31590 by Daniel Cremers, Guangyao Zhai, Guowen Huang, Lei Zhang, Qing Cheng, Ruotong Liao, Thomas Seidl, Volker Tresp, Xun Xiao.

Figure 1
Figure 1. Figure 1: Different denoising steps utilize text inputs differently. TunerDiT finds this insight by probing prompt conditioning of video diffusion models. When switching the input prompt at various fractions of the denoising steps (e.g., 0%, 10%, 20%, 40%, 70%, 100%), it is observed that early steps dominate the global layout while late steps refine fine-grained appearance and motion, revealing intrinsic turning poi… view at source ↗
Figure 2
Figure 2. Figure 2: DiT video diffusion models failed with multi-event prompting. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MEve and turning-point analysis. (a,c) Dataset structure: category and view distribution statistics. (b) Position of MEve among existing video generation benchmarks. (d) Probing results revealing intrinsic turning points for event control in Video DiTs. events, i.e. 𝜏 = ⌊𝑥𝑁⌋. The intuition behind this is that the step is where the DiTs hesitate to determine which is the dominant event layout. For steps pre… view at source ↗
Figure 4
Figure 4. Figure 4: TunerDiT progressively steers multi-event generation over diffusion steps. Cross-Event Prompt Fusion (PF) first shares a common prompt to build a coherent layout, then gradually separates event prompts. Event-Partitioned Mask (EM) subsequently enforces event isolation via diagonal cross-attention blocks and introduces cross-event transition bands around boundaries, enabling smooth and semantically consiste… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison to various open-source base models (Open-Sora 1.2, Open-Sora 2.0 and Wan 2.2), direct concatenation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: , 7 illustrate how each steering component in TunerDiT contributes to multi-event generation quality. Overall, we observe a clear tension between text align￾ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Metric trends of TunerDiT under different module combinations. ment (TA, TIS) and visual consistency (BC, IC, CSCV): improving one group often degrades the other. But TunerDiT manages to achieve a balance. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance trends of TunerDiT as the number of events [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of steering position on multi-event generation. The [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Interface of annotating multi-event generation quality. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of generated frames as the Event Parti [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Investigation on dataset categories. Notion: GE (gen [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of failure modes in TunerDiT caused by [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that probing DiT denoising trajectories reveals intrinsic turning points where text conditioning shifts from controlling global layout to fine-grained details. Building on this, TunerDiT is a training-free progressive steering method with two handles—Event-Partitioned Masking (enforcing boundaries with transition bands) and Cross-Event Prompt Fusion (injecting neighboring semantics for refinement)—applied at those points. A self-curated Meve benchmark is introduced, and TunerDiT is reported to achieve SOTA results on 8 metrics versus other training-free methods, with a tunable consistency/separation trade-off and scaling improvement in text alignment as event count grows.

Significance. If the results hold, the work provides a simple, training-free approach to the challenging multi-event T2V setting and contributes the Meve benchmark. The training-free nature, explicit tunable trade-off, and reported scaling with event count are concrete strengths. The probing-based insight into DiT trajectories, if substantiated, could inform future steering methods.

major comments (1)
  1. [Probing experiment and §4 (method)] The central justification for the method (abstract and method description) is that the two steering handles must be applied at the probed turning points. No ablation is reported that applies the handles at alternative schedules or timesteps and compares resulting metrics on Meve; without this, it is unclear whether the claimed SOTA performance and scaling depend on the specific turning-point timing or would arise from the handles alone.
minor comments (1)
  1. [Abstract] The abstract asserts SOTA across 8 metrics and scaling behavior but supplies no numerical values, metric definitions, or baseline details; these should be summarized with references to the experimental tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Probing experiment and §4 (method)] The central justification for the method (abstract and method description) is that the two steering handles must be applied at the probed turning points. No ablation is reported that applies the handles at alternative schedules or timesteps and compares resulting metrics on Meve; without this, it is unclear whether the claimed SOTA performance and scaling depend on the specific turning-point timing or would arise from the handles alone.

    Authors: We agree that the manuscript lacks an ablation comparing the steering handles applied at the probed turning points versus alternative schedules or timesteps, with quantitative results on Meve. This omission leaves open whether the reported gains and scaling behavior are specifically due to the timing identified by probing. In the revised manuscript we will add this ablation study, reporting the eight metrics for multiple alternative schedules on the Meve benchmark to clarify the contribution of the turning-point schedule. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external benchmark

full rationale

The paper describes a training-free steering method whose design is motivated by observational probing of DiT trajectories, with performance measured on the self-curated Meve benchmark. No equations, fitted parameters, or self-citations are shown that reduce the claimed SOTA metrics or scaling behavior to quantities defined by the method itself. The derivation chain consists of an empirical observation followed by a heuristic implementation and external evaluation, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation of turning points in the denoising trajectory; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5750 in / 1170 out tokens · 18177 ms · 2026-06-28T23:10:26.893746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Ait- tala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Kar- ras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers.arXiv preprint arXiv:2211.01324, 2022. 2, 3

  2. [2]

    Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video gen- eration

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7763–7772, 2025. 2, 3, 6

  3. [3]

    Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video gener- ation, 2025

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video gener- ation, 2025. 1

  4. [4]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320,

  5. [5]

    Training-free layout control with cross-attention guidance

    Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. InProceedings of the IEEE/CVF winter conference on applications of com- puter vision, pages 5343–5353, 2024. 2, 3

  6. [6]

    Gentron: Diffusion transformers for image and video generation, 2024

    Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Diffusion transformers for image and video generation, 2024. 1

  7. [7]

    Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024

    Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024. 2

  8. [8]

    Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, and et al. Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities, 2025. 7, 1

  9. [9]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 1

  10. [10]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...

  11. [11]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 1, 2

  12. [12]

    Vbench: Compre- hensive benchmark suite for video generative models, 2023

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Compre- hensive benchmark suite for video generative models, 2023. 7, 1

  13. [13]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 3

  14. [14]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models, 2024

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench++: Comprehensive and versatile benchmark suite for video generative models, 2024. 3, 1

  15. [15]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024. 7

  16. [16]

    How Far is Video Generation from World Model: A Physical Law Perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 2

  17. [17]

    Shotadapter: Text-to-multi- shot video generation with diffusion models

    Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz. Shotadapter: Text-to-multi- shot video generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28405–28415, 2025. 2, 3, 6, 1

  18. [18]

    Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Sys- tems, 37:89834–89868, 2024. 2

  19. [19]

    Gonzalez, Ion Stoica, Song Han, and Y ao Lu

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025. 2 10

  20. [20]

    Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model.arXiv preprint arXiv:2411.17459, 2024

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model.arXiv preprint arXiv:2411.17459, 2024. 2, 6

  21. [21]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Min- bin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

  22. [22]

    Videoinsta: Zero-shot long video understanding via informative spatial- temporal reasoning with llms

    Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, and Volker Tresp. Videoinsta: Zero-shot long video understanding via informative spatial- temporal reasoning with llms. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6577– 6602, 2024. 2

  23. [23]

    Gentkg: Generative forecasting on temporal knowl- edge graph with large language models

    Ruotong Liao, Xu Jia, Yangzhe Li, Yunpu Ma, and Volker Tresp. Gentkg: Generative forecasting on temporal knowl- edge graph with large language models. InFindings of the as- sociation for computational linguistics: NAACL 2024, pages 4303–4317, 2024. 2

  24. [24]

    When and where do events switch in multi-event video generation?, 2025

    Ruotong Liao, Guowen Huang, Qing Cheng, Thomas Seidl, Daniel Cremers, and Volker Tresp. When and where do events switch in multi-event video generation?, 2025. 2, 9

  25. [25]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Li- uhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 2, 6

  26. [26]

    Worldweaver: Generating long- horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025

    Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, and Linjie Yang. Worldweaver: Generating long- horizon video worlds via rich perception.arXiv preprint arXiv:2508.15720, 2025. 2

  27. [27]

    The shape variational autoencoder: A deep generative model of part- segmented 3d objects

    Charlie Nash and Christopher KI Williams. The shape variational autoencoder: A deep generative model of part- segmented 3d objects. InComputer Graphics Forum, pages 1–12. Wiley Online Library, 2017. 2

  28. [28]

    Mevg: Multi-event video generation with text-to-video models

    Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision, pages 401–418. Springer, 2024. 2, 3, 6

  29. [29]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, P...

  30. [30]

    Scalable diffusion models with transformers, 2023

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. 1, 2

  31. [31]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  32. [32]

    Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation

    Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, and Yongdong Zhang. Maskˆ 2dit: Dual mask-based diffusion transformer for multi-scene long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18837–18846, 2025. 3

  33. [33]

    Mask2dit: Dual mask-based diffusion transformer for multi-scene long video generation, 2025

    Tianhao Qi, Jianlong Yuan, Wanquan Feng, Shancheng Fang, Jiawei Liu, SiYu Zhou, Qian He, Hongtao Xie, and Yongdong Zhang. Mask2dit: Dual mask-based diffusion transformer for multi-scene long video generation, 2025. 2, 3

  34. [34]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling, 2024

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xin- tao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling, 2024. 2, 6

  35. [35]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 7

  36. [36]

    Sam 2: Segment anything in images and videos, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. 7

  37. [37]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural lan- guage processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019. 7

  38. [38]

    arXiv preprint arXiv:2411.02265 (2024) 31

    Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent.arXiv preprint arXiv:2411.02265, 2024. 2

  39. [39]

    Mochi 1.https://github.com/ genmoai/models, 2024

    Genmo Team. Mochi 1.https://github.com/ genmoai/models, 2024. 2

  40. [40]

    The tensor brain: A unified theory of perception, memory, and semantic decoding.Neu- ral Computation, 35(2):156–227, 2023

    Volker Tresp, Sahand Sharifzadeh, Hang Li, Dario Konopatzki, and Yunpu Ma. The tensor brain: A unified theory of perception, memory, and semantic decoding.Neu- ral Computation, 35(2):156–227, 2023. 1

  41. [41]

    Wan: Open and advanced large-scale video gener- ative models

    Wan-AI. Wan: Open and advanced large-scale video gener- ative models. 2025. 6

  42. [42]

    Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

    Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023. 2

  43. [43]

    Gen-l-video: Multi-text to long video generation via temporal co-denoising, 2023

    Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising, 2023. 2

  44. [44]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation, 2024

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation, 2024. 7 11

  45. [45]

    Mind the time: Temporally-controlled multi-event video generation

    Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Sko- rokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, and Sergey Tulyakov. Mind the time: Temporally-controlled multi-event video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23989– 24000, 2025. 2, 6

  46. [46]

    A survey on video diffusion models, 2024

    Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models, 2024. 1

  47. [47]

    Dynamic prompt learning: Addressing cross- attention leakage for text-based image editing.Advances in Neural Information Processing Systems, 36:26291–26303,

    Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer, et al. Dynamic prompt learning: Addressing cross- attention leakage for text-based image editing.Advances in Neural Information Processing Systems, 36:26291–26303,

  48. [48]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2

  49. [49]

    Instadrive: Instance-aware driving world models for realistic and consistent video generation

    Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu, and Yanyong Zhang. Instadrive: Instance-aware driving world models for realistic and consistent video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 25410–25420, 2025. 2

  50. [50]

    Evaluation agent: Efficient and promptable evalua- tion framework for visual generative models.arXiv preprint arXiv:2412.09645, 2024

    Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evalua- tion framework for visual generative models.arXiv preprint arXiv:2412.09645, 2024. 3, 1

  51. [51]

    Magiccomp: Training-free dual-phase refinement for compositional video generation.arXiv preprint arXiv:2503.14428, 2025

    Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, and Jie Chen. Magiccomp: Training-free dual-phase refinement for compositional video generation.arXiv preprint arXiv:2503.14428, 2025. 3

  52. [52]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,

  53. [53]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 3, 1 12 TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Consistent Multi-Event Video ...

  54. [54]

    Figure 10

    Overall Preference:Please rate your overall preference for the video. Figure 10. Interface of annotating multi-event generation quality

  55. [55]

    Motion Naturalness:How natural and realistic is the motion in the video?

  56. [56]

    Transition Smoothness:How smoothly does the video content transition across different frames?

  57. [57]

    TunerDiT achieves superior Text-Video Alignment scores compared to the base- line models, including open-source base models and zero- shot methods MEVG, DiTCtrl and FreeNoise

    Text-Video Alignment:To what extent does the video content match the given text descriptions? Evaluation ResultsThe quantitative results, derived from evaluations by 14 participants with mimum degree in bachlors, are presented in Table 1b. TunerDiT achieves superior Text-Video Alignment scores compared to the base- line models, including open-source base ...