Compositional Video Generation via Inference-Time Guidance

Amit Edenzon; Ariel Shaulov; Eitan Shaar; Gal Chechik; Lior Wolf

arxiv: 2605.14988 · v1 · pith:BPM5DLXPnew · submitted 2026-05-14 · 💻 cs.CV

Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov , Eitan Shaar , Amit Edenzon , Gal Chechik , Lior Wolf This is my paper

Pith reviewed 2026-06-30 21:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords compositional generationtext-to-video diffusioninference-time guidancecross-attention mapsprompt faithfulnessdenoising trajectoryfrozen generator

0 comments

The pith

A classifier trained on cross-attention maps steers frozen text-to-video diffusion models toward accurate compositions at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that text-to-video diffusion models often fail on prompts involving relations between objects, attributes, and motions, but these failures can be reduced without retraining the generator or adding user controls. Instead, cross-attention maps already carry information about how prompt concepts are placed in space and time. A lightweight classifier is trained on those maps, and its gradients are applied during the first denoising steps to nudge the latent code toward the intended composition. Experiments on compositional benchmarks show higher prompt faithfulness while the base model's visual quality remains unchanged. The approach transfers across related composition types because it relies on a frozen vision-language backbone rather than category-specific features.

Core claim

CVG is an inference-time method that trains a lightweight compositional classifier on the cross-attention features extracted from a frozen text-to-video diffusion model and then uses the classifier's gradients to steer the early denoising trajectory; this yields improved faithfulness on prompts that require fine-grained relations, attributes, actions, and motion directions without any architecture change, fine-tuning, or external layout inputs.

What carries the argument

CVG, the inference-time guidance procedure that extracts cross-attention maps, trains a compositional classifier on them, and injects the classifier gradients into the latent trajectory during early denoising steps.

If this is right

Compositional accuracy rises on benchmarks that test relations, attributes, actions, and motion directions.
Visual quality metrics of the underlying generator stay essentially unchanged.
No model retraining, architecture edits, or user-provided boxes or layouts are required.
The same classifier transfers across semantically related composition labels via its VLM backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-based steering could be tested on text-to-image models to check whether the benefit is video-specific.
If the classifier is kept fixed, the method might allow rapid adaptation to new composition vocabularies by swapping only the label set.
Early-step guidance might be combined with later-step quality-preserving samplers to further separate faithfulness from fidelity.

Load-bearing premise

Cross-attention maps already encode how prompt concepts are grounded across space and time so that gradients from a classifier trained on those features can steer the latent trajectory toward the desired composition during early denoising steps.

What would settle it

Running the method on a held-out compositional text-to-video benchmark and finding no statistically significant rise in human or automated prompt-faithfulness scores relative to the unmodified base model would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.14988 by Amit Edenzon, Ariel Shaulov, Eitan Shaar, Gal Chechik, Lior Wolf.

**Figure 2.** Figure 2: Overview of CVG. At inference time, CVG extracts subject-token cross-attention maps from a frozen text-to-video diffusion model and feeds them to a lightweight composition classifier. The classifier predicts the current compositional relation, and its loss with respect to the target compositional relation is backpropagated to update the latent. This provides composition-aware guidance without fine-tuning t… view at source ↗

**Figure 3.** Figure 3: Training compositional classifier. Given a real video and its prompt, we invert the video into the latent space of the frozen text-to-video model and extract subject-token cross-attention maps. These maps are processed by an aggregation module, a frozen VLM, and a trainable classification head to predict the target compositional relation label using cross-entropy supervision. classifier on top of these att… view at source ↗

**Figure 4.** Figure 4: Qualitative results. Text-to-video results before and after applying CVG on Wan2.2-14B (58) and CogVideoX5B (22). CVG strictly adheres to compositional instructions. 4.3 User Study We further conduct a human preference study to evaluate whether the quantitative gains of CVG translate into perceptual improvements in generated videos. We randomly select 100 prompts from T2V-CompBench and generate videos usi… view at source ↗

**Figure 5.** Figure 5: User study results. Human preference comparison between CVG and TTOM. Participants compare generated videos along text alignment, motion, quality, and compositionality, and select whether CVG, TTOM, or both are better. The same pattern holds in the long-video setting with Rolling Forcing. CVG is preferred for Compositionality in 40.7% of comparisons, compared to 17.2% for TTOM. It is also preferred for Mot… view at source ↗

**Figure 6.** Figure 6: Additional Qualitative Results comparison between CVG and Wan2.2-14B (right), and between CVG and [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CVG introduces inference-time steering of video diffusion via a classifier on cross-attention features, but the abstract leaves the actual gains and the temporal encoding assumption untested.

read the letter

CVG is an inference-time method that trains a lightweight classifier on cross-attention maps from a frozen text-to-video model and back-propagates its gradients in early denoising steps to push the output toward better prompt composition.

What stands out as new is the specific setup: a transferable classifier built on a VLM backbone that learns from attention features rather than from layouts or category-specific labels, then used for guidance without any generator fine-tuning or extra user controls. The paper frames this as a practical fix for known weaknesses in handling relations, attributes, actions, and motion.

The approach is presented cleanly. Avoiding retraining and external controls is a real advantage for downstream use, and the claim that the classifier transfers across semantically related compositions is a reasonable design move.

The main soft spot is the missing evidence. The abstract states that experiments on compositional benchmarks show improved faithfulness while preserving quality, yet supplies no numbers, ablations, or details on which attention layers are used or how the classifier is trained. The stress-test concern lands here: standard cross-attention in video models is largely spatial and token-aligned, while temporal dynamics often sit in separate layers. If those maps do not carry distinguishable signals for dynamic relations or motion directions, the guidance signal could be weak or noisy. The paper would need to show that the chosen features actually separate correct from incorrect compositions for the steering to deliver the claimed results.

This paper is for researchers working on inference-time fixes for diffusion models or on compositional generation. A reader already following guidance techniques would get value from the method description even before seeing results.

It deserves peer review because the core idea is distinct and addresses a practical limitation, though the current write-up is too thin on data to judge whether the steering works as described.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CVG, an inference-time guidance technique for improving compositional faithfulness (relations, attributes, actions, motion directions) in frozen text-to-video diffusion models. It trains a lightweight classifier on cross-attention features extracted from the generator and back-propagates its gradients during early denoising steps to steer the latent trajectory, without architecture changes, fine-tuning, or user-provided layouts.

Significance. If the central mechanism holds, the result would be significant: it demonstrates that internal cross-attention signals can be leveraged for compositional steering at inference time, offering a practical alternative to retraining or auxiliary control signals while preserving generator quality. The transferability of the classifier across related composition labels is a notable design choice.

major comments (2)

[§3] §3 (method description): The claim that cross-attention maps already encode grounding 'across space and time' for dynamic elements (actions, motion directions) is load-bearing for the guidance signal. Standard video diffusion attention is often spatially dominant with temporal dynamics handled in separate layers; without explicit evidence that the selected maps linearly separate correct vs. incorrect compositions for relational/motion cases, the classifier gradients may be weak or noisy.
[Experiments] Experiments section: The reported improvements in prompt faithfulness on compositional benchmarks must be supported by quantitative metrics (e.g., composition accuracy scores, human evaluations) and ablations isolating the contribution of the attention-based classifier versus generic guidance; the abstract alone does not establish that the steering produces the claimed gains without quality degradation.

minor comments (2)

[§3.1] Clarify the exact attention layers and feature extraction procedure used for the classifier input, including any temporal aggregation.
[§3.2] The transfer claim for the classifier across semantically related labels would benefit from a brief statement of the label set and training protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. Below we address each major comment point by point, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (method description): The claim that cross-attention maps already encode grounding 'across space and time' for dynamic elements (actions, motion directions) is load-bearing for the guidance signal. Standard video diffusion attention is often spatially dominant with temporal dynamics handled in separate layers; without explicit evidence that the selected maps linearly separate correct vs. incorrect compositions for relational/motion cases, the classifier gradients may be weak or noisy.

Authors: We agree that stronger direct evidence for linear separability of correct versus incorrect compositions in the selected cross-attention features would strengthen the justification for the guidance signal. The current manuscript supports the claim via the classifier's training accuracy on held-out relational and motion compositions together with qualitative attention visualizations; however, these do not constitute an explicit linear-probe analysis. We will add a short subsection with linear probing results and feature-separability metrics on the attention maps for the dynamic cases. revision: partial
Referee: [Experiments] Experiments section: The reported improvements in prompt faithfulness on compositional benchmarks must be supported by quantitative metrics (e.g., composition accuracy scores, human evaluations) and ablations isolating the contribution of the attention-based classifier versus generic guidance; the abstract alone does not establish that the steering produces the claimed gains without quality degradation.

Authors: Section 4 of the manuscript already reports quantitative composition accuracy on the benchmarks, human preference studies, and ablations that compare the attention-based classifier against generic guidance baselines while tracking quality metrics (FID, CLIP similarity). These results show gains in faithfulness with no measurable quality drop. We will add a brief summary table in the main text that directly juxtaposes the classifier-guided results against the generic-guidance controls to make the isolation clearer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central method trains an external lightweight classifier on frozen cross-attention features extracted from the generator and then applies its gradients for inference-time steering. This classifier training step is independent of the target compositional faithfulness metric and does not reduce to a fitted parameter or self-citation chain that defines the result by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation. The approach is therefore a standard external-guidance technique whose validity rests on empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on the unstated assumption that attention maps contain sufficient compositional signal and that the classifier generalizes across labels.

pith-pipeline@v0.9.1-grok · 5724 in / 1036 out tokens · 21857 ms · 2026-06-30T21:30:57.163016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 31 canonical work pages · 12 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 843–852. IEEE/CVF, Vancouver, Canada (2023)

2023
[2]

In: Proceedings of the 40th International Conference on Machine Learning

Bar-Tal, O., Yariv, L., Lipman, Y ., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. In: Proceedings of the 40th International Conference on Machine Learning. pp. 1737–1752. PMLR, Honolulu, Hawaii, USA (2023)

2023
[3]

Black, K., Janner, M., Du, Y ., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning (2024),https://arxiv.org/abs/2305.13301

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

ACM Trans

Chefer, H., Alaluf, Y ., Vinker, Y ., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph.42(4) (Jul 2023). https://doi.org/10.1145/3592116, https://doi.org/10.1145/3592116

work page doi:10.1145/3592116 2023
[5]

Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards (2024),https://arxiv.org/abs/2309.17400

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/ 2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Du, Y ., Li, S., Mordatch, I.: Compositional visual generation and inference with energy based models (2020), https://arxiv.org/abs/2004.06030

work page arXiv 2020
[8]

Fan, Z., Wang, Z., Zhang, W.: Taocache: Structure-maintained video generation acceleration (2025), https: //arxiv.org/abs/2508.08978

work page arXiv 2025
[9]

Fang, X., Ma, L., Chen, Z., Zhou, M., Qi, G.J.: Inflvg: Reinforce inference-time consistent long video generation with grpo (2025)

2025
[10]

E., and Wang, W

Feng, W., He, X., Fu, T.J., Jampani, V ., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y .: Training- free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2023)

work page arXiv 2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y ., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1–12. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025
[12]

Guo, Y ., Yang, C., Rao, A., Liang, Z., Wang, Y ., Qiao, Y ., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023)

2023
[13]

In: Proceedings of the European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: Proceedings of the European Conference on Computer Vision. pp. 393–411. Springer, Cham (2024)

2024
[14]

He, H., Liang, J., Wang, X., Wan, P., Zhang, D., Gai, K., Pan, L.: Scaling image and video generation via test-time evolutionary search (2025)

2025
[15]

He, W., Liu, M., Yu, Y ., Wang, Z., Wu, C.: Dyst-xl: Dynamic layout planning and content control for compositional text-to-video generation (2025),https://arxiv.org/abs/2504.15032

work page arXiv 2025
[16]

He, Y ., Salakhutdinov, R., Kolter, J.Z.: Localized text-to-image generation for free via cross attention control (2023),https://arxiv.org/abs/2306.14636

work page arXiv 2023
[17]

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control (2022),https://arxiv.org/abs/2208.01626

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models (2022)

2022
[19]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020), https://arxiv.org/abs/2006. 11239

2020
[20]

Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Advances in neural information processing systems35, 8633–8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633–8646 (2022)

2022
[22]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 11 APREPRINT-

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025)

2025
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818. IEEE, Seattle, W A, USA (2024)

2024
[25]

Huang, Z., Yu, N., Chen, G., Qiu, H., Debevec, P., Liu, Z.: Vchain: Chain-of-visual-thought for reasoning in video generation (2025),https://arxiv.org/abs/2510.05094

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Vbench++: Comprehensive and ver- satile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Huang, Z., Zhang, F., Xu, X., He, Y ., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y ., Wang, Y ., Chen, X., Chen, Y .C., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503 (2024)

work page arXiv 2024
[27]

Jin, Y ., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y ., Mu, Y ., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)

2024
[28]

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V ., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation (2023)

2023
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22511–22521. IEEE/CVF, Vancouver, Canada (2023)

2023
[30]

Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model (2024)

2024
[31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Liu, F., Wang, H., Cai, Y ., Zhang, K., Zhan, X., Duan, Y .: Video-t1: Test-time scaling for video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18671–18681. IEEE/CVF, Honolulu, Hawaii, USA (2025)

2025
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, F., Zhang, S., Wang, X., Wei, Y ., Qiu, H., Zhao, Y ., Zhang, Y ., Ye, Q., Wan, F.: Timestep embedding tells: It’s time to cache for video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7353–7363. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025
[33]

Liu, K., Hu, W., Xu, J., Shan, Y ., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025)

2025
[34]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Liu, N., Li, S., Du, Y ., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 423–439. Springer Nature Switzerland, Cham (2022)

2022
[35]

arXiv preprint arXiv:2405.06948 (2024)

Liu, S., Wang, B., Ma, Y ., Yang, T., Cao, X., Chen, Q., Li, H., Dong, D., Jiang, P.: Training-free subject-enhanced attention guidance for compositional text-to-image generation. arXiv preprint arXiv:2405.06948 (2024)

work page arXiv 2024
[36]

In: Advances in Neural Information Processing Systems

Lu, Y ., Liang, Y ., Zhu, L., Yang, Y .: Freelong: Training-free long video generation with spectralblend temporal attention. In: Advances in Neural Information Processing Systems. vol. 37, pp. 131434–131455. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-4177, https://proceedings.neurips. cc/paper_files/paper/2024/fil...

work page doi:10.52202/079017-4177 2024
[37]

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y .C., Zhang, M., Yang, X., Li, Y ., Jaakkola, T., Jia, X., Xie, S.: Inference-time scaling for diffusion models beyond scaling denoising steps (2025),https://arxiv.org/abs/2501.09732

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Oh, G., Jeong, J., Kim, S., Byeon, W., Kim, J., Kim, S., Kim, S.: Mevg: Multi-event video generation with text-to- video models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 401–418. Springer Nature Switzerland, Cham (2025)

2024
[39]

Oshima, Y ., Suzuki, M., Matsuo, Y ., Furuta, H.: Inference-time text-to-video alignment with diffusion latent beam search (2025)

2025
[40]

Papalampidi, P., Wiles, O., Ktena, I., Shtedritski, A., Bugliarello, E., Kajic, I., Albuquerque, I., Nematzadeh, A.: Dynamic classifier-free diffusion guidance via online feedback (2025), https://arxiv.org/abs/2509.16131

work page arXiv 2025
[41]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022), https: //arxiv.org/abs/2209.14988

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., Pathak, D.: Video diffusion alignment via reward gradients (2024),https://arxiv.org/abs/2407.08737

work page arXiv 2024
[43]

Qu, L., Wang, Z., Zheng, N., Wang, W., Nie, L., Chua, T.S.: TTOM: Test-time optimization and memorization for compositional video generation (2025) 12 APREPRINT-

2025
[44]

https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

Qwen Team: Qwen2.5-vl. https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

2025
[45]

Ren, W., Yang, H., Zhang, G., Wei, C., Du, X., Huang, W., Chen, W.: Consisti2v: Enhancing visual consistency for image-to-video generation (2024),https://arxiv.org/abs/2402.04324

work page arXiv 2024
[46]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

In: Proceedings of the 2019 on International Conference on Multimedia Retrieval

Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. p. 279–287. ICMR ’19, Association for Computing Machinery, New York, NY , USA (2019). https://doi.org/10.1145/3323873.3325056, https://doi.org/10.1145/33238...

work page doi:10.1145/3323873.3325056 2019
[48]

In: Proceedings of the International Conference on Multimedia Retrieval

Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the International Conference on Multimedia Retrieval. pp. 279–287. ACM, Ottawa, Ontario, Canada (2019)

2019
[49]

In: Proceedings of the 25th ACM International Conference on Multimedia

Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. p. 1300–1308. MM ’17, Association for Computing Machinery, New York, NY , USA (2017). https://doi.org/10.1145/3123266.3123380,https://doi.org/10.1145/3123266. 3123380

work page doi:10.1145/3123266.3123380 2017
[50]

In: Proceedings of the 25th ACM International Conference on Multimedia

Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1300–1308. ACM, Mountain View, CA, USA (2017)

2017
[51]

Shaulov, A., Hazan, I., Wolf, L., Chefer, H.: FlowMo: Variance-based flow guidance for coherent motion in video generation (2025)

2025
[52]

Shaulov, A., Shaar, E., Edenzon, A., Wolf, L.: TokenTrim: Inference-time token pruning for autoregressive long video generation (2026)

2026
[53]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shi, Y ., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V .Y .F., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8839–8849. IEEE/CVF, Seattle, Washington, USA (2024)

2024
[54]

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data (2022)

2022
[55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sun, K., Huang, K., Liu, X., Wu, Y ., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8406–8416. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025
[56]

In: Advances in Neural In- formation Processing Systems

Tian, Y ., Yang, L., Yang, H., Gao, Y ., Deng, Y ., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P., Zhang, D., Cui, B.: Videotetris: Towards compositional text-to-video generation. In: Advances in Neural In- formation Processing Systems. vol. 37, pp. 29489–29513. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-0928, https...

work page doi:10.52202/079017-0928 2024
[57]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wallace, B., Gokul, A., Ermon, S., Naik, N.: End-to-end diffusion latent optimization improves classifier guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7280–7290. IEEE/CVF, Paris, France (2023)

2023
[58]

Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models (2025)

2025
[59]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Wang, J., Zhang, Y ., Zou, J., Zeng, Y ., Wei, G., Yuan, L., Li, H.: Boximator: Generating rich and controllable motions for video synthesis (2024),https://arxiv.org/abs/2402.01566

work page arXiv 2024
[61]

Wang, W., Chen, Y ., Liu, Y ., Yuan, Q., Yang, S., Zhang, Y .: Mvoc: a training-free multiple video object composition method with diffusion models (2024),https://arxiv.org/abs/2406.15829

work page arXiv 2024
[62]

In: Advances in Neural Information Processing Systems

Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y ., Shen, Y ., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion controllability. In: Advances in Neural Information Processing Systems. vol. 36, pp. 7594–7611. Curran Associates, Inc., Red Hook, NY , USA (2023),https://proceedings.neurips. cc/paper_files/paper/2023/file...

2023
[63]

International Journal of Computer Vision 133(5), 3059–3078 (2025)

Wang, Y ., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y ., Yang, C., He, Y ., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)

2025
[64]

Wang, Y ., Xiong, T., Zhou, D., Lin, Z., Zhao, Y ., Kang, B., Feng, J., Liu, X.: Loong: Generating minute-level long videos with autoregressive language models (2024)

2024
[65]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wu, T., Si, C., Jiang, Y ., Huang, Z., Liu, Z.: Freeinit: Bridging initialization gap in video diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 378–394. Springer Nature Switzerland, Cham (2025)

2024
[66]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y ., Chordia, V ., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23989–24000. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025
[67]

Yang, X., Wang, X.: Compositional video generation as flow equalization (2024),https://arxiv.org/abs/ 2407.06182

work page arXiv 2024
[68]

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer (2024)

2024
[69]

arXiv preprint arXiv:2506.08004 (2025)

Yesiltepe, H., Yanardag, P.: Dynamic view synthesis as an inverse problem. arXiv preprint arXiv:2506.08004 (2025)

work page arXiv 2025
[70]

Yiflach, S.E., Atzmon, Y ., Chechik, G.: Data-driven loss functions for inference-time optimization in text-to-image generation (2025)

2025
[71]

Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory (2023),https://arxiv.org/abs/2308.08089

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22963–22974. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025
[73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yu, J., Wang, Y ., Zhao, C., Ghanem, B., Zhang, J.: Freedom: Training-free energy-guided conditional diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23174–23184. IEEE/CVF, Paris, France (2023)

2023
[74]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847. IEEE/CVF, Paris, France (2023)

2023
[75]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y ., Li, X.: Layoutdiffusion: Controllable diffusion model for layout-to- image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22490–22499. IEEE/CVF, Vancouver, Canada (2023)

2023
[76]

a dog turning left,

Zhuo, L., Zhao, L., Paul, S., Liao, Y ., Zhang, R., Xin, Y ., Gao, P., Elhoseiny, M., Li, H.: From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15329–15339. IEEE/CVF, Honolulu, Hawaii, USA (2025) 14 AP...

2025

[1] [1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

Bansal, A., Chu, H.M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., Goldstein, T.: Universal guidance for diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 843–852. IEEE/CVF, Vancouver, Canada (2023)

2023

[2] [2]

In: Proceedings of the 40th International Conference on Machine Learning

Bar-Tal, O., Yariv, L., Lipman, Y ., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. In: Proceedings of the 40th International Conference on Machine Learning. pp. 1737–1752. PMLR, Honolulu, Hawaii, USA (2023)

2023

[3] [3]

Black, K., Janner, M., Du, Y ., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning (2024),https://arxiv.org/abs/2305.13301

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

ACM Trans

Chefer, H., Alaluf, Y ., Vinker, Y ., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph.42(4) (Jul 2023). https://doi.org/10.1145/3592116, https://doi.org/10.1145/3592116

work page doi:10.1145/3592116 2023

[5] [5]

Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards (2024),https://arxiv.org/abs/2309.17400

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/ 2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Du, Y ., Li, S., Mordatch, I.: Compositional visual generation and inference with energy based models (2020), https://arxiv.org/abs/2004.06030

work page arXiv 2020

[8] [8]

Fan, Z., Wang, Z., Zhang, W.: Taocache: Structure-maintained video generation acceleration (2025), https: //arxiv.org/abs/2508.08978

work page arXiv 2025

[9] [9]

Fang, X., Ma, L., Chen, Z., Zhou, M., Qi, G.J.: Inflvg: Reinforce inference-time consistent long video generation with grpo (2025)

2025

[10] [10]

E., and Wang, W

Feng, W., He, X., Fu, T.J., Jampani, V ., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y .: Training- free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2023)

work page arXiv 2023

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y ., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1–12. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025

[12] [12]

Guo, Y ., Yang, C., Rao, A., Liang, Z., Wang, Y ., Qiao, Y ., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning (2023)

2023

[13] [13]

In: Proceedings of the European Conference on Computer Vision

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: Proceedings of the European Conference on Computer Vision. pp. 393–411. Springer, Cham (2024)

2024

[14] [14]

He, H., Liang, J., Wang, X., Wan, P., Zhang, D., Gai, K., Pan, L.: Scaling image and video generation via test-time evolutionary search (2025)

2025

[15] [15]

He, W., Liu, M., Yu, Y ., Wang, Z., Wu, C.: Dyst-xl: Dynamic layout planning and content control for compositional text-to-video generation (2025),https://arxiv.org/abs/2504.15032

work page arXiv 2025

[16] [16]

He, Y ., Salakhutdinov, R., Kolter, J.Z.: Localized text-to-image generation for free via cross attention control (2023),https://arxiv.org/abs/2306.14636

work page arXiv 2023

[17] [17]

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control (2022),https://arxiv.org/abs/2208.01626

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models (2022)

2022

[19] [19]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020), https://arxiv.org/abs/2006. 11239

2020

[20] [20]

Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022),https://arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Advances in neural information processing systems35, 8633–8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633–8646 (2022)

2022

[22] [22]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 11 APREPRINT-

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025)

2025

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818. IEEE, Seattle, W A, USA (2024)

2024

[25] [25]

Huang, Z., Yu, N., Chen, G., Qiu, H., Debevec, P., Liu, Z.: Vchain: Chain-of-visual-thought for reasoning in video generation (2025),https://arxiv.org/abs/2510.05094

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Vbench++: Comprehensive and ver- satile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Huang, Z., Zhang, F., Xu, X., He, Y ., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y ., Wang, Y ., Chen, X., Chen, Y .C., Wang, L., Lin, D., Qiao, Y ., Liu, Z.: VBench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503 (2024)

work page arXiv 2024

[27] [27]

Jin, Y ., Sun, Z., Li, N., Xu, K., Jiang, H., Zhuang, N., Huang, Q., Song, Y ., Mu, Y ., Lin, Z.: Pyramidal flow matching for efficient video generative modeling (2024)

2024

[28] [28]

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V ., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation (2023)

2023

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22511–22521. IEEE/CVF, Vancouver, Canada (2023)

2023

[30] [30]

Lin, B., Ge, Y ., Cheng, X., Li, Z., Zhu, B., Wang, S., He, X., Ye, Y ., Yuan, S., Chen, L., et al.: Open-sora plan: Open-source large video generation model (2024)

2024

[31] [31]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Liu, F., Wang, H., Cai, Y ., Zhang, K., Zhan, X., Duan, Y .: Video-t1: Test-time scaling for video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18671–18681. IEEE/CVF, Honolulu, Hawaii, USA (2025)

2025

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, F., Zhang, S., Wang, X., Wei, Y ., Qiu, H., Zhao, Y ., Zhang, Y ., Ye, Q., Wan, F.: Timestep embedding tells: It’s time to cache for video diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7353–7363. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025

[33] [33]

Liu, K., Hu, W., Xu, J., Shan, Y ., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time (2025)

2025

[34] [34]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Liu, N., Li, S., Du, Y ., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 423–439. Springer Nature Switzerland, Cham (2022)

2022

[35] [35]

arXiv preprint arXiv:2405.06948 (2024)

Liu, S., Wang, B., Ma, Y ., Yang, T., Cao, X., Chen, Q., Li, H., Dong, D., Jiang, P.: Training-free subject-enhanced attention guidance for compositional text-to-image generation. arXiv preprint arXiv:2405.06948 (2024)

work page arXiv 2024

[36] [36]

In: Advances in Neural Information Processing Systems

Lu, Y ., Liang, Y ., Zhu, L., Yang, Y .: Freelong: Training-free long video generation with spectralblend temporal attention. In: Advances in Neural Information Processing Systems. vol. 37, pp. 131434–131455. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-4177, https://proceedings.neurips. cc/paper_files/paper/2024/fil...

work page doi:10.52202/079017-4177 2024

[37] [37]

Ma, N., Tong, S., Jia, H., Hu, H., Su, Y .C., Zhang, M., Yang, X., Li, Y ., Jaakkola, T., Jia, X., Xie, S.: Inference-time scaling for diffusion models beyond scaling denoising steps (2025),https://arxiv.org/abs/2501.09732

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Oh, G., Jeong, J., Kim, S., Byeon, W., Kim, J., Kim, S., Kim, S.: Mevg: Multi-event video generation with text-to- video models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 401–418. Springer Nature Switzerland, Cham (2025)

2024

[39] [39]

Oshima, Y ., Suzuki, M., Matsuo, Y ., Furuta, H.: Inference-time text-to-video alignment with diffusion latent beam search (2025)

2025

[40] [40]

Papalampidi, P., Wiles, O., Ktena, I., Shtedritski, A., Bugliarello, E., Kajic, I., Albuquerque, I., Nematzadeh, A.: Dynamic classifier-free diffusion guidance via online feedback (2025), https://arxiv.org/abs/2509.16131

work page arXiv 2025

[41] [41]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022), https: //arxiv.org/abs/2209.14988

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., Pathak, D.: Video diffusion alignment via reward gradients (2024),https://arxiv.org/abs/2407.08737

work page arXiv 2024

[43] [43]

Qu, L., Wang, Z., Zheng, N., Wang, W., Nie, L., Chua, T.S.: TTOM: Test-time optimization and memorization for compositional video generation (2025) 12 APREPRINT-

2025

[44] [44]

https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

Qwen Team: Qwen2.5-vl. https://huggingface.co/docs/transformers/model_doc/qwen2_5_vl (2025), hugging Face Transformers documentation

2025

[45] [45]

Ren, W., Yang, H., Zhang, G., Wei, C., Du, X., Huang, W., Chen, W.: Consisti2v: Enhancing visual consistency for image-to-video generation (2024),https://arxiv.org/abs/2402.04324

work page arXiv 2024

[46] [46]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

In: Proceedings of the 2019 on International Conference on Multimedia Retrieval

Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. p. 279–287. ICMR ’19, Association for Computing Machinery, New York, NY , USA (2019). https://doi.org/10.1145/3323873.3325056, https://doi.org/10.1145/33238...

work page doi:10.1145/3323873.3325056 2019

[48] [48]

In: Proceedings of the International Conference on Multimedia Retrieval

Shang, X., Di, D., Xiao, J., Cao, Y ., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the International Conference on Multimedia Retrieval. pp. 279–287. ACM, Ottawa, Ontario, Canada (2019)

2019

[49] [49]

In: Proceedings of the 25th ACM International Conference on Multimedia

Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. p. 1300–1308. MM ’17, Association for Computing Machinery, New York, NY , USA (2017). https://doi.org/10.1145/3123266.3123380,https://doi.org/10.1145/3123266. 3123380

work page doi:10.1145/3123266.3123380 2017

[50] [50]

In: Proceedings of the 25th ACM International Conference on Multimedia

Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia. pp. 1300–1308. ACM, Mountain View, CA, USA (2017)

2017

[51] [51]

Shaulov, A., Hazan, I., Wolf, L., Chefer, H.: FlowMo: Variance-based flow guidance for coherent motion in video generation (2025)

2025

[52] [52]

Shaulov, A., Shaar, E., Edenzon, A., Wolf, L.: TokenTrim: Inference-time token pruning for autoregressive long video generation (2026)

2026

[53] [53]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shi, Y ., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V .Y .F., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8839–8849. IEEE/CVF, Seattle, Washington, USA (2024)

2024

[54] [54]

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data (2022)

2022

[55] [55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Sun, K., Huang, K., Liu, X., Wu, Y ., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8406–8416. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025

[56] [56]

In: Advances in Neural In- formation Processing Systems

Tian, Y ., Yang, L., Yang, H., Gao, Y ., Deng, Y ., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P., Zhang, D., Cui, B.: Videotetris: Towards compositional text-to-video generation. In: Advances in Neural In- formation Processing Systems. vol. 37, pp. 29489–29513. Curran Associates, Inc., Red Hook, NY , USA (2024). https://doi.org/10.52202/079017-0928, https...

work page doi:10.52202/079017-0928 2024

[57] [57]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wallace, B., Gokul, A., Ermon, S., Naik, N.: End-to-end diffusion latent optimization improves classifier guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7280–7290. IEEE/CVF, Paris, France (2023)

2023

[58] [58]

Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models (2025)

2025

[59] [59]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Wang, J., Zhang, Y ., Zou, J., Zeng, Y ., Wei, G., Yuan, L., Li, H.: Boximator: Generating rich and controllable motions for video synthesis (2024),https://arxiv.org/abs/2402.01566

work page arXiv 2024

[61] [61]

Wang, W., Chen, Y ., Liu, Y ., Yuan, Q., Yang, S., Zhang, Y .: Mvoc: a training-free multiple video object composition method with diffusion models (2024),https://arxiv.org/abs/2406.15829

work page arXiv 2024

[62] [62]

In: Advances in Neural Information Processing Systems

Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y ., Shen, Y ., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion controllability. In: Advances in Neural Information Processing Systems. vol. 36, pp. 7594–7611. Curran Associates, Inc., Red Hook, NY , USA (2023),https://proceedings.neurips. cc/paper_files/paper/2023/file...

2023

[63] [63]

International Journal of Computer Vision 133(5), 3059–3078 (2025)

Wang, Y ., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y ., Yang, C., He, Y ., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)

2025

[64] [64]

Wang, Y ., Xiong, T., Zhou, D., Lin, Z., Zhao, Y ., Kang, B., Feng, J., Liu, X.: Loong: Generating minute-level long videos with autoregressive language models (2024)

2024

[65] [65]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Wu, T., Si, C., Jiang, Y ., Huang, Z., Liu, Z.: Freeinit: Bridging initialization gap in video diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 378–394. Springer Nature Switzerland, Cham (2025)

2024

[66] [66]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Siarohin, A., Menapace, W., Skorokhodov, I., Fang, Y ., Chordia, V ., Gilitschenski, I., Tulyakov, S.: Mind the time: Temporally-controlled multi-event video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23989–24000. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025

[67] [67]

Yang, X., Wang, X.: Compositional video generation as flow equalization (2024),https://arxiv.org/abs/ 2407.06182

work page arXiv 2024

[68] [68]

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y ., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer (2024)

2024

[69] [69]

arXiv preprint arXiv:2506.08004 (2025)

Yesiltepe, H., Yanardag, P.: Dynamic view synthesis as an inverse problem. arXiv preprint arXiv:2506.08004 (2025)

work page arXiv 2025

[70] [70]

Yiflach, S.E., Atzmon, Y ., Chechik, G.: Data-driven loss functions for inference-time optimization in text-to-image generation (2025)

2025

[71] [71]

Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory (2023),https://arxiv.org/abs/2308.08089

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [72]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22963–22974. IEEE/CVF, Nashville, Tennessee, USA (2025)

2025

[73] [73]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yu, J., Wang, Y ., Zhao, C., Ghanem, B., Zhang, J.: Freedom: Training-free energy-guided conditional diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23174–23184. IEEE/CVF, Paris, France (2023)

2023

[74] [74]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847. IEEE/CVF, Paris, France (2023)

2023

[75] [75]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, G., Zhou, X., Li, X., Qi, Z., Shan, Y ., Li, X.: Layoutdiffusion: Controllable diffusion model for layout-to- image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22490–22499. IEEE/CVF, Vancouver, Canada (2023)

2023

[76] [76]

a dog turning left,

Zhuo, L., Zhao, L., Paul, S., Liao, Y ., Zhang, R., Xin, Y ., Gao, P., Elhoseiny, M., Li, H.: From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15329–15339. IEEE/CVF, Honolulu, Hawaii, USA (2025) 14 AP...

2025