pith. sign in

arxiv: 2605.14988 · v1 · pith:BPM5DLXPnew · submitted 2026-05-14 · 💻 cs.CV

Compositional Video Generation via Inference-Time Guidance

Pith reviewed 2026-06-30 21:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords compositional generationtext-to-video diffusioninference-time guidancecross-attention mapsprompt faithfulnessdenoising trajectoryfrozen generator
0
0 comments X

The pith

A classifier trained on cross-attention maps steers frozen text-to-video diffusion models toward accurate compositions at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that text-to-video diffusion models often fail on prompts involving relations between objects, attributes, and motions, but these failures can be reduced without retraining the generator or adding user controls. Instead, cross-attention maps already carry information about how prompt concepts are placed in space and time. A lightweight classifier is trained on those maps, and its gradients are applied during the first denoising steps to nudge the latent code toward the intended composition. Experiments on compositional benchmarks show higher prompt faithfulness while the base model's visual quality remains unchanged. The approach transfers across related composition types because it relies on a frozen vision-language backbone rather than category-specific features.

Core claim

CVG is an inference-time method that trains a lightweight compositional classifier on the cross-attention features extracted from a frozen text-to-video diffusion model and then uses the classifier's gradients to steer the early denoising trajectory; this yields improved faithfulness on prompts that require fine-grained relations, attributes, actions, and motion directions without any architecture change, fine-tuning, or external layout inputs.

What carries the argument

CVG, the inference-time guidance procedure that extracts cross-attention maps, trains a compositional classifier on them, and injects the classifier gradients into the latent trajectory during early denoising steps.

If this is right

  • Compositional accuracy rises on benchmarks that test relations, attributes, actions, and motion directions.
  • Visual quality metrics of the underlying generator stay essentially unchanged.
  • No model retraining, architecture edits, or user-provided boxes or layouts are required.
  • The same classifier transfers across semantically related composition labels via its VLM backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-based steering could be tested on text-to-image models to check whether the benefit is video-specific.
  • If the classifier is kept fixed, the method might allow rapid adaptation to new composition vocabularies by swapping only the label set.
  • Early-step guidance might be combined with later-step quality-preserving samplers to further separate faithfulness from fidelity.

Load-bearing premise

Cross-attention maps already encode how prompt concepts are grounded across space and time so that gradients from a classifier trained on those features can steer the latent trajectory toward the desired composition during early denoising steps.

What would settle it

Running the method on a held-out compositional text-to-video benchmark and finding no statistically significant rise in human or automated prompt-faithfulness scores relative to the unmodified base model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.14988 by Amit Edenzon, Ariel Shaulov, Eitan Shaar, Gal Chechik, Lior Wolf.

Figure 1
Figure 1. Figure 1: Text-to-video results before and after applying CVG on Wan2.2-14B (58) and CogVideoX-5B (68) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CVG. At inference time, CVG extracts subject-token cross-attention maps from a frozen text-to-video diffusion model and feeds them to a lightweight composition classifier. The classifier predicts the current compositional relation, and its loss with respect to the target compositional relation is backpropagated to update the latent. This provides composition-aware guidance without fine-tuning t… view at source ↗
Figure 3
Figure 3. Figure 3: Training compositional classifier. Given a real video and its prompt, we invert the video into the latent space of the frozen text-to-video model and extract subject-token cross-attention maps. These maps are processed by an aggregation module, a frozen VLM, and a trainable classification head to predict the target compositional relation label using cross-entropy supervision. classifier on top of these att… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Text-to-video results before and after applying CVG on Wan2.2-14B (58) and CogVideoX￾5B (22). CVG strictly adheres to compositional instructions. 4.3 User Study We further conduct a human preference study to evaluate whether the quantitative gains of CVG translate into perceptual improvements in generated videos. We randomly select 100 prompts from T2V-CompBench and generate videos usi… view at source ↗
Figure 5
Figure 5. Figure 5: User study results. Human preference comparison between CVG and TTOM. Participants compare generated videos along text alignment, motion, quality, and compositionality, and select whether CVG, TTOM, or both are better. The same pattern holds in the long-video setting with Rolling Forcing. CVG is preferred for Compositionality in 40.7% of comparisons, compared to 17.2% for TTOM. It is also preferred for Mot… view at source ↗
Figure 6
Figure 6. Figure 6: Additional Qualitative Results comparison between CVG and Wan2.2-14B (right), and between CVG and [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
read the original abstract

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CVG, an inference-time guidance technique for improving compositional faithfulness (relations, attributes, actions, motion directions) in frozen text-to-video diffusion models. It trains a lightweight classifier on cross-attention features extracted from the generator and back-propagates its gradients during early denoising steps to steer the latent trajectory, without architecture changes, fine-tuning, or user-provided layouts.

Significance. If the central mechanism holds, the result would be significant: it demonstrates that internal cross-attention signals can be leveraged for compositional steering at inference time, offering a practical alternative to retraining or auxiliary control signals while preserving generator quality. The transferability of the classifier across related composition labels is a notable design choice.

major comments (2)
  1. [§3] §3 (method description): The claim that cross-attention maps already encode grounding 'across space and time' for dynamic elements (actions, motion directions) is load-bearing for the guidance signal. Standard video diffusion attention is often spatially dominant with temporal dynamics handled in separate layers; without explicit evidence that the selected maps linearly separate correct vs. incorrect compositions for relational/motion cases, the classifier gradients may be weak or noisy.
  2. [Experiments] Experiments section: The reported improvements in prompt faithfulness on compositional benchmarks must be supported by quantitative metrics (e.g., composition accuracy scores, human evaluations) and ablations isolating the contribution of the attention-based classifier versus generic guidance; the abstract alone does not establish that the steering produces the claimed gains without quality degradation.
minor comments (2)
  1. [§3.1] Clarify the exact attention layers and feature extraction procedure used for the classifier input, including any temporal aggregation.
  2. [§3.2] The transfer claim for the classifier across semantically related labels would benefit from a brief statement of the label set and training protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. Below we address each major comment point by point, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (method description): The claim that cross-attention maps already encode grounding 'across space and time' for dynamic elements (actions, motion directions) is load-bearing for the guidance signal. Standard video diffusion attention is often spatially dominant with temporal dynamics handled in separate layers; without explicit evidence that the selected maps linearly separate correct vs. incorrect compositions for relational/motion cases, the classifier gradients may be weak or noisy.

    Authors: We agree that stronger direct evidence for linear separability of correct versus incorrect compositions in the selected cross-attention features would strengthen the justification for the guidance signal. The current manuscript supports the claim via the classifier's training accuracy on held-out relational and motion compositions together with qualitative attention visualizations; however, these do not constitute an explicit linear-probe analysis. We will add a short subsection with linear probing results and feature-separability metrics on the attention maps for the dynamic cases. revision: partial

  2. Referee: [Experiments] Experiments section: The reported improvements in prompt faithfulness on compositional benchmarks must be supported by quantitative metrics (e.g., composition accuracy scores, human evaluations) and ablations isolating the contribution of the attention-based classifier versus generic guidance; the abstract alone does not establish that the steering produces the claimed gains without quality degradation.

    Authors: Section 4 of the manuscript already reports quantitative composition accuracy on the benchmarks, human preference studies, and ablations that compare the attention-based classifier against generic guidance baselines while tracking quality metrics (FID, CLIP similarity). These results show gains in faithfulness with no measurable quality drop. We will add a brief summary table in the main text that directly juxtaposes the classifier-guided results against the generic-guidance controls to make the isolation clearer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central method trains an external lightweight classifier on frozen cross-attention features extracted from the generator and then applies its gradients for inference-time steering. This classifier training step is independent of the target compositional faithfulness metric and does not reduce to a fitted parameter or self-citation chain that defines the result by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation. The approach is therefore a standard external-guidance technique whose validity rests on empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on the unstated assumption that attention maps contain sufficient compositional signal and that the classifier generalizes across labels.

pith-pipeline@v0.9.1-grok · 5724 in / 1036 out tokens · 21857 ms · 2026-06-30T21:30:57.163016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 31 canonical work pages · 12 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 843–852. IEEE/CVF, Vancouver, Canada (2023)

  2. [2]

    In: Proceedings of the 40th International Conference on Machine Learning

    Bar-Tal, O., Yariv, L., Lipman, Y ., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. In: Proceedings of the 40th International Conference on Machine Learning. pp. 1737–1752. PMLR, Honolulu, Hawaii, USA (2023)

  3. [3]

    Black, K., Janner, M., Du, Y ., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning (2024),https://arxiv.org/abs/2305.13301

  4. [4]

    ACM Trans

    Chefer, H., Alaluf, Y ., Vinker, Y ., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph.42(4) (Jul 2023). https://doi.org/10.1145/3592116, https://doi.org/10.1145/3592116

  5. [5]

    Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards (2024),https://arxiv.org/abs/2309.17400

  6. [6]

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/ 2105.05233

  7. [7]

    Du, Y ., Li, S., Mordatch, I.: Compositional visual generation and inference with energy based models (2020), https://arxiv.org/abs/2004.06030

  8. [8]

    Fan, Z., Wang, Z., Zhang, W.: Taocache: Structure-maintained video generation acceleration (2025), https: //arxiv.org/abs/2508.08978

  9. [9]

    Fang, X., Ma, L., Chen, Z., Zhou, M., Qi, G.J.: Inflvg: Reinforce inference-time consistent long video generation with grpo (2025)

  10. [10]

    E., and Wang, W

    Feng, W., He, X., Fu, T.J., Jampani, V ., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y .: Training- free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y ., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1–12. IEEE/CVF, Nashville, Tennessee, USA (2025)

  12. [12]

    Guo, Y ., Yang, C., Rao, A., Liang, Z., Wang, Y ., Qiao, Y ., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023)

  13. [13]

    In: Proceedings of the European Conference on Computer Vision

    Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: Proceedings of the European Conference on Computer Vision. pp. 393–411. Springer, Cham (2024)

  14. [14]

    He, H., Liang, J., Wang, X., Wan, P., Zhang, D., Gai, K., Pan, L.: Scaling image and video generation via test-time evolutionary search (2025)

  15. [15]

    He, W., Liu, M., Yu, Y ., Wang, Z., Wu, C.: Dyst-xl: Dynamic layout planning and content control for compositional text-to-video generation (2025),https://arxiv.org/abs/2504.15032

  16. [16]

    He, Y ., Salakhutdinov, R., Kolter, J.Z.: Localized text-to-image generation for free via cross attention control (2023),https://arxiv.org/abs/2306.14636

  17. [17]

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control (2022),https://arxiv.org/abs/2208.01626

  18. [18]

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models (2022)

  19. [19]

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020), https://arxiv.org/abs/2006. 11239

  20. [20]

    Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/abs/2207.12598

  21. [21]

    Advances in neural information processing systems35, 8633–8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633–8646 (2022)

  22. [22]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 11 APREPRINT-

  23. [23]

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818. IEEE, Seattle, W A, USA (2024)

  25. [25]

    Huang, Z., Yu, N., Chen, G., Qiu, H., Debevec, P., Liu, Z.: Vchain: Chain-of-visual-thought for reasoning in video generation (2025),https://arxiv.org/abs/2510.05094

  26. [26]

    Vbench++: Comprehensive and ver- satile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Huang, Z., Zhang, F., Xu, X., He, Y ., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y ., Wang, Y ., Chen, X., Chen, Y .C., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503 (2024)

  27. [27]

    Jin, Y ., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y ., Mu, Y ., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)

  28. [28]

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V ., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation (2023)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22511–22521. IEEE/CVF, Vancouver, Canada (2023)

  30. [30]

    Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model (2024)

  31. [31]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, F., Wang, H., Cai, Y ., Zhang, K., Zhan, X., Duan, Y .: Video-t1: Test-time scaling for video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18671–18681. IEEE/CVF, Honolulu, Hawaii, USA (2025)

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, F., Zhang, S., Wang, X., Wei, Y ., Qiu, H., Zhao, Y ., Zhang, Y ., Ye, Q., Wan, F.: Timestep embedding tells: It’s time to cache for video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7353–7363. IEEE/CVF, Nashville, Tennessee, USA (2025)

  33. [33]

    Liu, K., Hu, W., Xu, J., Shan, Y ., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025)

  34. [34]

    In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

    Liu, N., Li, S., Du, Y ., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 423–439. Springer Nature Switzerland, Cham (2022)

  35. [35]

    arXiv preprint arXiv:2405.06948 (2024)

    Liu, S., Wang, B., Ma, Y ., Yang, T., Cao, X., Chen, Q., Li, H., Dong, D., Jiang, P.: Training-free subject-enhanced attention guidance for compositional text-to-image generation. arXiv preprint arXiv:2405.06948 (2024)

  36. [36]

    In: Advances in Neural Information Processing Systems

    Lu, Y ., Liang, Y ., Zhu, L., Yang, Y .: Freelong: Training-free long video generation with spectralblend temporal attention. In: Advances in Neural Information Processing Systems. vol. 37, pp. 131434–131455. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-4177, https://proceedings.neurips. cc/paper_files/paper/2024/fil...

  37. [37]

    Ma, N., Tong, S., Jia, H., Hu, H., Su, Y .C., Zhang, M., Yang, X., Li, Y ., Jaakkola, T., Jia, X., Xie, S.: Inference-time scaling for diffusion models beyond scaling denoising steps (2025),https://arxiv.org/abs/2501.09732

  38. [38]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Oh, G., Jeong, J., Kim, S., Byeon, W., Kim, J., Kim, S., Kim, S.: Mevg: Multi-event video generation with text-to- video models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 401–418. Springer Nature Switzerland, Cham (2025)

  39. [39]

    Oshima, Y ., Suzuki, M., Matsuo, Y ., Furuta, H.: Inference-time text-to-video alignment with diffusion latent beam search (2025)

  40. [40]

    Papalampidi, P., Wiles, O., Ktena, I., Shtedritski, A., Bugliarello, E., Kajic, I., Albuquerque, I., Nematzadeh, A.: Dynamic classifier-free diffusion guidance via online feedback (2025), https://arxiv.org/abs/2509.16131

  41. [41]

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022), https: //arxiv.org/abs/2209.14988

  42. [42]

    Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., Pathak, D.: Video diffusion alignment via reward gradients (2024),https://arxiv.org/abs/2407.08737

  43. [43]

    Qu, L., Wang, Z., Zheng, N., Wang, W., Nie, L., Chua, T.S.: TTOM: Test-time optimization and memorization for compositional video generation (2025) 12 APREPRINT-

  44. [44]

    https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

    Qwen Team: Qwen2.5-vl. https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

  45. [45]

    Ren, W., Yang, H., Zhang, G., Wei, C., Du, X., Huang, W., Chen, W.: Consisti2v: Enhancing visual consistency for image-to-video generation (2024),https://arxiv.org/abs/2402.04324

  46. [46]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/2112.10752

  47. [47]

    In: Proceedings of the 2019 on International Conference on Multimedia Retrieval

    Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. p. 279–287. ICMR ’19, Association for Computing Machinery, New York, NY , USA (2019). https://doi.org/10.1145/3323873.3325056, https://doi.org/10.1145/33238...

  48. [48]

    In: Proceedings of the International Conference on Multimedia Retrieval

    Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the International Conference on Multimedia Retrieval. pp. 279–287. ACM, Ottawa, Ontario, Canada (2019)

  49. [49]

    In: Proceedings of the 25th ACM International Conference on Multimedia

    Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. p. 1300–1308. MM ’17, Association for Computing Machinery, New York, NY , USA (2017). https://doi.org/10.1145/3123266.3123380,https://doi.org/10.1145/3123266. 3123380

  50. [50]

    In: Proceedings of the 25th ACM International Conference on Multimedia

    Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1300–1308. ACM, Mountain View, CA, USA (2017)

  51. [51]

    Shaulov, A., Hazan, I., Wolf, L., Chefer, H.: FlowMo: Variance-based flow guidance for coherent motion in video generation (2025)

  52. [52]

    Shaulov, A., Shaar, E., Edenzon, A., Wolf, L.: TokenTrim: Inference-time token pruning for autoregressive long video generation (2026)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shi, Y ., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V .Y .F., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8839–8849. IEEE/CVF, Seattle, Washington, USA (2024)

  54. [54]

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data (2022)

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sun, K., Huang, K., Liu, X., Wu, Y ., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8406–8416. IEEE/CVF, Nashville, Tennessee, USA (2025)

  56. [56]

    In: Advances in Neural In- formation Processing Systems

    Tian, Y ., Yang, L., Yang, H., Gao, Y ., Deng, Y ., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P., Zhang, D., Cui, B.: Videotetris: Towards compositional text-to-video generation. In: Advances in Neural In- formation Processing Systems. vol. 37, pp. 29489–29513. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-0928, https...

  57. [57]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wallace, B., Gokul, A., Ermon, S., Naik, N.: End-to-end diffusion latent optimization improves classifier guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7280–7290. IEEE/CVF, Paris, France (2023)

  58. [58]

    Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models (2025)

  59. [59]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)

  60. [60]

    Wang, J., Zhang, Y ., Zou, J., Zeng, Y ., Wei, G., Yuan, L., Li, H.: Boximator: Generating rich and controllable motions for video synthesis (2024),https://arxiv.org/abs/2402.01566

  61. [61]

    Wang, W., Chen, Y ., Liu, Y ., Yuan, Q., Yang, S., Zhang, Y .: Mvoc: a training-free multiple video object composition method with diffusion models (2024),https://arxiv.org/abs/2406.15829

  62. [62]

    In: Advances in Neural Information Processing Systems

    Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y ., Shen, Y ., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion controllability. In: Advances in Neural Information Processing Systems. vol. 36, pp. 7594–7611. Curran Associates, Inc., Red Hook, NY , USA (2023),https://proceedings.neurips. cc/paper_files/paper/2023/file...

  63. [63]

    International Journal of Computer Vision 133(5), 3059–3078 (2025)

    Wang, Y ., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y ., Yang, C., He, Y ., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)

  64. [64]

    Wang, Y ., Xiong, T., Zhou, D., Lin, Z., Zhao, Y ., Kang, B., Feng, J., Liu, X.: Loong: Generating minute-level long videos with autoregressive language models (2024)

  65. [65]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Wu, T., Si, C., Jiang, Y ., Huang, Z., Liu, Z.: Freeinit: Bridging initialization gap in video diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 378–394. Springer Nature Switzerland, Cham (2025)

  66. [66]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y ., Chordia, V ., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23989–24000. IEEE/CVF, Nashville, Tennessee, USA (2025)

  67. [67]

    Yang, X., Wang, X.: Compositional video generation as flow equalization (2024),https://arxiv.org/abs/ 2407.06182

  68. [68]

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer (2024)

  69. [69]

    arXiv preprint arXiv:2506.08004 (2025)

    Yesiltepe, H., Yanardag, P.: Dynamic view synthesis as an inverse problem. arXiv preprint arXiv:2506.08004 (2025)

  70. [70]

    Yiflach, S.E., Atzmon, Y ., Chechik, G.: Data-driven loss functions for inference-time optimization in text-to-image generation (2025)

  71. [71]

    Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory (2023),https://arxiv.org/abs/2308.08089

  72. [72]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22963–22974. IEEE/CVF, Nashville, Tennessee, USA (2025)

  73. [73]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yu, J., Wang, Y ., Zhao, C., Ghanem, B., Zhang, J.: Freedom: Training-free energy-guided conditional diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23174–23184. IEEE/CVF, Paris, France (2023)

  74. [74]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847. IEEE/CVF, Paris, France (2023)

  75. [75]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y ., Li, X.: Layoutdiffusion: Controllable diffusion model for layout-to- image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22490–22499. IEEE/CVF, Vancouver, Canada (2023)

  76. [76]

    a dog turning left,

    Zhuo, L., Zhao, L., Paul, S., Liao, Y ., Zhang, R., Xin, Y ., Gao, P., Elhoseiny, M., Li, H.: From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15329–15339. IEEE/CVF, Honolulu, Hawaii, USA (2025) 14 AP...