pith. sign in

arxiv: 2605.22996 · v1 · pith:OHCIT4CCnew · submitted 2026-05-21 · 💻 cs.CV

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

Pith reviewed 2026-05-25 05:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords controllable video generationbinary mask conditioningmotion layersMMDiTLoRA adaptationMaskAdapterinteractive dynamicsdiffusion transformer
0
0 comments X

The pith

CoMoGen generates videos with precise subject motion and interactions from binary mask sequences by adapting motion-specific layers in a diffusion transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoMoGen to produce videos showing realistic subject movements and interactions with people, objects, and scenes, using only an input image and a sequence of binary masks as control. It solves the problem of locating motion-handling parts inside uniform MMDiT transformer blocks by defining Motion Layers in attention space, then applies lightweight LoRA fine-tuning only to those layers. A MaskAdapter converts the mask sequence into a residual signal that is added through a cosine schedule. This selective adaptation keeps the base model unchanged while focusing computation on motion. If the approach holds, it would allow more accurate mask-driven video synthesis with lower training cost than full-model methods.

Core claim

CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes by encoding binary mask sequences into a latent residual signal via a lightweight MaskAdapter, injecting the signal into the MMDiT model through a cosine-weighted schedule, identifying Motion Layers in the attention space of MMDiT, and fine-tuning only those layers with LoRA without any architecture change.

What carries the argument

Motion Layers identified in the attention space of MMDiT, which are adapted via LoRA after mask signals are injected by the MaskAdapter.

If this is right

  • Videos can be generated with subject trajectories and interactions dictated directly by binary mask sequences.
  • Plausible contacts between the controlled subject and other scene elements occur without explicit 3D modeling.
  • Training cost drops because only a small subset of transformer layers receives LoRA updates.
  • The same base MMDiT model can be reused for different motion tasks by swapping the adapted Motion Layers.
  • Performance exceeds earlier mask-conditioned video methods on standard motion and realism metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The layer-identification technique could be tested on other transformer video models to see whether similar motion subspaces exist.
  • Mask sequences might be combined with text prompts to add semantic constraints while retaining spatial control.
  • The cosine injection schedule could be replaced by learned schedules to check whether further motion accuracy is possible.
  • Real-time applications such as interactive video editing become feasible if inference remains close to the base model speed.

Load-bearing premise

The procedure for locating Motion Layers in MMDiT attention space correctly isolates the components responsible for motion so that LoRA on only those layers delivers motion control without unwanted effects on other parts of generation.

What would settle it

An experiment showing that LoRA adaptation on the identified Motion Layers produces no measurable gain in motion fidelity or introduces visible artifacts in appearance, lighting, or non-motion elements compared with adapting random layers.

Figures

Figures reproduced from arXiv: 2605.22996 by Adil Meric, Benjamin Busam, Christian Theobalt, Lin Geng Foo, Mert Kiray, Rishabh Dabral.

Figure 1
Figure 1. Figure 1: Overview. Given a single input image and a binary mask sequence (left), Co￾MoGen generates videos where the masked subject follows the steering mask sequence while the model generates plausible interactions with the surroundings. On the right, we show applications: a) Text to human animation from textual motion and rendered masks, b) Object manipulation conditioned on a mask sequence, c) Motion to video an… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. Given a spatiotemporal binary mask sequence, we first downsample it to latent resolution and project it into a latent residual ∆Z using a lightweight MaskAdapter. During image-to-video generation, we inject this residual by modulating the current noised video latent at the transformer input, and propagate the conditioning through only the identified Motion Layers (DiT Blocks marked in r… view at source ↗
Figure 3
Figure 3. Figure 3: Attention score over layers. Motion Layers (marked in red) show con￾sistently higher subject-mask alignment than Non-Motion Layers (marked in blue). Frame RGB Frame Mask Motion Layers Non-Motion Layers Prompt: A man grabs a backpack from the ground [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer skipping. Skipping Non-Motion Layers mainly introduces artifacts, whereas skipping Motion Layers disrupts motion dynamics and temporal coherence. is the timestep and ℓ denotes the layer number. Following the findings of [10], we consider both the text-to-video and video-to-text attention directions, and compute their average to capture bidirectional dependency between visual and textual modalities. W… view at source ↗
Figure 6
Figure 6. Figure 6: Results on human–object interaction. We present generation results for (a) Force Propagation and (b) Control Signal Editing. For (b), we use the same input image with different control signals. Ours GT Ours GT Ours GT [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results on human-object interaction. We present generation results and the corresponding ground truths for different objects (box, luggage bag, and ball). 5.1 Dataset We use two different datasets CLEVRER [57] and BEHAVE [6], and train two different models on each dataset. CLEVRER [57] provides synthetic videos with multiple objects undergoing collision events. CLEVRER provides 10k training and 5k validati… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on human-object interactions and object col [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Out-of-distribution edits. We edit input images with a text-guided edi￾tor [29] and show that generated interac￾tions remain coherent on fictional charac￾ters. 5.5 Ablation Study We ablate the effects of cosine-weighted latent injection, and restricting LoRA updates to Non-Motion Layers [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Attention Score visualization across the layers with error bars. (IoU) between the pseudo ground-truth mask and predicted mask which rep￾resents region similarity. F is the contour accuracy, computed from contour precision/recall, emphasizing shape alignment. HOTA (Higher Order Tracking Accuracy) is a standard multi object tracking (MOT) metric that aids in measur￾ing identity consistency, we report HOTA … view at source ↗
Figure 12
Figure 12. Figure 12: Attention Score visualization of CogVideoX [55]. Method J ↑ F ↑ J &F ↑ HOTA↑ Skip Motion Layers 39.4 32.0 35.7 36.5 Skip Non-Motion Layers 53.5 48.3 50.9 55.0 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule. Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT. This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: mericadil.github.io/CoMoGen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents CoMoGen, a framework for generating controllable videos from an input image and binary mask sequence. It introduces a MaskAdapter that encodes the mask sequence as a latent residual injected into MMDiT via a cosine-weighted schedule, identifies 'Motion Layers' in the MMDiT attention space, applies LoRA only to those layers for fine-tuning, and claims this yields precise subject motion, plausible interactions, and SOTA results on motion fidelity and perceptual realism across datasets without architecture changes.

Significance. If the motion-layer selection mechanism is shown to isolate motion-critical components without side effects, the approach would provide a practical, low-cost adaptation strategy for uniform transformer-based video diffusion models that lack UNet-style hierarchy, potentially improving efficiency and controllability in mask-guided generation.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the procedure for determining 'Motion Layers' in MMDiT attention space is stated to be novel and necessary because MMDiT lacks UNet hierarchy, yet no criterion, metric (e.g., attention statistics, gradient importance), algorithm, or layer-wise ablation is supplied. This selection is load-bearing for the central claim that selective LoRA produces motion control without affecting appearance or scene consistency.
  2. [Abstract] Abstract: the claim of 'comprehensive experiments' and 'state-of-the-art performance in motion fidelity and perceptual realism' is unsupported by any quantitative metrics, dataset names/sizes, baseline comparisons, or validation protocol for the motion layers. Without these, the data-to-claim link cannot be evaluated.
minor comments (1)
  1. [Abstract] The abstract asserts 'precise subject motion and plausible interactions' but provides no definition or measurement protocol for these properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas where additional detail would strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the procedure for determining 'Motion Layers' in MMDiT attention space is stated to be novel and necessary because MMDiT lacks UNet hierarchy, yet no criterion, metric (e.g., attention statistics, gradient importance), algorithm, or layer-wise ablation is supplied. This selection is load-bearing for the central claim that selective LoRA produces motion control without affecting appearance or scene consistency.

    Authors: We agree that the selection procedure for Motion Layers requires explicit documentation. The manuscript currently asserts novelty without supplying the underlying criterion, metrics, or ablations. In the revised version we will insert a dedicated subsection in the Method section that describes the attention-statistic-based selection algorithm, the precise metrics employed, and the layer-wise ablation results demonstrating that these layers control motion while preserving appearance and scene consistency. revision: yes

  2. Referee: [Abstract] Abstract: the claim of 'comprehensive experiments' and 'state-of-the-art performance in motion fidelity and perceptual realism' is unsupported by any quantitative metrics, dataset names/sizes, baseline comparisons, or validation protocol for the motion layers. Without these, the data-to-claim link cannot be evaluated.

    Authors: The abstract summarizes the experimental outcomes at a high level. To make the SOTA claims directly traceable to data, we will revise the abstract to name the datasets and their sizes, report the principal quantitative metrics (motion fidelity and perceptual realism scores), and reference the baseline comparisons. The full experimental protocol, including the motion-layer validation, already appears in Section 4; the abstract revision will ensure the high-level claims are explicitly supported by those results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution

full rationale

The paper's central mechanism is the proposal of a novel (but undetailed in the provided text) procedure for identifying Motion Layers in MMDiT attention space followed by selective LoRA fine-tuning. No equations, fitted parameters, or self-citations are shown that reduce the claimed motion control or performance gains to a definition or input by construction. The abstract frames the work as an empirical engineering advance with experimental validation on datasets, and the selection of layers is presented as a methodological choice rather than a self-referential derivation. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach builds on existing MMDiT and LoRA components without stating new fitted constants or unproven assumptions.

pith-pipeline@v0.9.0 · 5774 in / 1160 out tokens · 24063 ms · 2026-05-25T05:43:25.235817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

  1. [1]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Akkerman, R., Feng, H., Black, M.J., Tzionas, D., Abrevaya, V.F.: Interdyn: Con- trollable interactive dynamics with video diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12467–12479 (2025)

  2. [2]

    arXiv preprint arXiv:2503.14492 (2025)

    Alhaija, H.A., Alvarez, J., Bala, M., Cai, T., Cao, T., Cha, L., Chen, J., Chen, M., Ferroni, F., Fidler, S., et al.: Cosmos-transfer1: Conditional world generation with adaptive multimodal control. arXiv preprint arXiv:2503.14492 (2025)

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)

    Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischinski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 7877–7888 (June 2025)

  4. [4]

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. Proc. CVPR (2025)

  5. [5]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  6. [6]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Bhatnagar, B.L., Xie, X., Petrov, I., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Behave: Dataset and method for tracking human object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2022)

  7. [7]

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets (2023),https: //arxiv.org/abs/2311.15127

  8. [8]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  9. [9]

    In: CVPR (2025), licensed under Modified Apache 2.0 with special crediting requirement

    Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., Ma, L., Deng, Y., Li, L., Mousavi, M., Ryoo, M., Debevec, P., Yu, N.: Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. In: CVPR (2025), licensed under Modified Apache 2.0 with special crediting requirement

  10. [10]

    arXiv:2412.18597 (2024)

    Cai, M., Cun, X., Li, X., Liu, W., Zhang, Z., Zhang, Y., Shan, Y., Yue, X.: Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation. arXiv:2412.18597 (2024)

  11. [11]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22560–22570 (October 2023)

  12. [12]

    arXiv preprint arXiv:2504.03072 (2025) 16 Meric et al

    Chang, P., Tang, J., Gross, M., Azevedo, V.C.: How i warped your noise: a temporally-correlated noise prior for diffusion models. arXiv preprint arXiv:2504.03072 (2025) 16 Meric et al

  13. [13]

    org/CorpusID:256416326

    Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-basedsemanticguidancefortext-to-imagediffusionmodels.ACMTrans- actions on Graphics (TOG)42, 1 – 10 (2023),https://api.semanticscholar. org/CorpusID:256416326

  14. [14]

    In: Proceedings of the 41st International Conference on Machine Learning

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rom- bach, R.: Scaling rectified flow transformers for high-resolution image synthesis. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24, JMLR.org (2024)

  15. [15]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Aytar, Y., Rubinstein, M., Sun, C., et al.: Motion prompting: Controlling video generation with motion trajectories. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1–12 (2025)

  16. [16]

    arXiv preprint arXiv:2412.02700 (2024)

    Geng, D., Herrmann, C., Hur, J., Cole, F., Zhang, S., Pfaff, T., Lopez-Guevara, T., Doersch, C., Aytar, Y., Rubinstein, M., Sun, C., Wang, O., Owens, A., Sun, D.: Motion prompting: Controlling video generation with motion trajectories. arXiv preprint arXiv:2412.02700 (2024)

  17. [17]

    In: ICCV (2023)

    Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4D: Reconstructing and tracking humans with transformers. In: ICCV (2023)

  18. [18]

    Google: A new era of intelligence with gemini 3.https://blog.google/products- and-platforms/products/gemini/gemini-3/(November 2025), accessed: 2026- 01-21

  19. [19]

    arXiv preprint arXiv:2501.03847 (2025)

    Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., Wang, W., Liu, Y.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)

  20. [20]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  21. [21]

    In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview. net/forum?id=_CDixzkzeyb

  22. [22]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  23. [23]

    In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Pro- ceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020)

  24. [24]

    Advances in neural information processing systems35, 8633– 8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022)

  25. [25]

    In: International Con- ference on Learning Representations (2022),https://openreview.net/forum?id= nZeVKeeFYf9

    Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Con- ference on Learning Representations (2022),https://openreview.net/forum?id= nZeVKeeFYf9

  26. [26]

    arXiv preprint arXiv:2503.18950 (2025)

    Kim, T., Joo, H.: Target-aware video diffusion models. arXiv preprint arXiv:2503.18950 (2025)

  27. [27]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) CoMoGen 17

  28. [28]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=arHJlYiY2J

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L., Wetzstein, G.: Collabo- rative video diffusion: Consistent multi-video generation with camera control. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum?id=arHJlYiY2J

  29. [29]

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025),https://arxiv.org/abs/2...

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Li, Q., Xing, Z., Wang, R., Zhang, H., Dai, Q., Wu, Z.: Magicmotion: Control- lable video generation with dense-to-sparse trajectory guidance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12112– 12123 (October 2025)

  31. [31]

    Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

    Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291 (2024)

  32. [32]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=aY3L65HgHJ

    Ling, P., Bu, J., Zhang, P., Dong, X., Zang, Y., Wu, T., Chen, H., Wang, J., Jin, Y.: Motionclone: Training-free motion cloning for controllable video generation. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=aY3L65HgHJ

  33. [33]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  34. [34]

    In: International Conferenceon LearningRepresentations(2019),https://openreview.net/forum? id=Bkg6RiCqY7

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conferenceon LearningRepresentations(2019),https://openreview.net/forum? id=Bkg6RiCqY7

  35. [35]

    com / JonathonLuiten / TrackEval(2020)

    Luiten, J., Hoffhues, A.: Trackeval.https : / / github . com / JonathonLuiten / TrackEval(2020)

  36. [36]

    International Journal of Computer Vision pp

    Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Leibe, B.: Hota: A higher order metric for evaluating multi-object tracking. International Journal of Computer Vision pp. 1–31 (2020)

  37. [37]

    arXiv preprint arXiv:2412.05275 (2024)

    Meral, T.H.S., Yesiltepe, H., Dunlop, C., Yanardag, P.: Motionflow: Attention- driven motion transfer in video diffusion models. arXiv preprint arXiv:2412.05275 (2024)

  38. [38]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=lvcWA24dxB

    Montanaro, A., Aira, L.S., Aiello, E., Valsesia, D., Magli, E.: Motioncraft: Physics- based zero-shot video generation. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https://openreview.net/forum? id=lvcWA24dxB

  39. [39]

    In: The Thir- teenth International Conference on Learning Representations (2025),https:// openreview.net/forum?id=uQjySppU9x

    Namekata, K., Bahmani, S., Wu, Z., Kant, Y., Gilitschenski, I., Lindell, D.B.: Sg-i2v: Self-guided trajectory control in image-to-video generation. In: The Thir- teenth International Conference on Learning Representations (2025),https:// openreview.net/forum?id=uQjySppU9x

  40. [40]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4172–4182 (2023).https://doi.org/10.1109/ICCV51070.2023.00387

  41. [41]

    In: Computer Vision and Pattern Recognition (2016)

    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine- Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)

  42. [42]

    arXiv preprint arXiv:2406.16863 (2024) 18 Meric et al

    Qiu, H., Chen, Z., Wang, Z., He, Y., Xia, M., Liu, Z.: Freetraj: Tuning-free trajec- tory control in video diffusion models. arXiv preprint arXiv:2406.16863 (2024) 18 Meric et al

  43. [43]

    Qiu, H., Chen, Z., Wang, Z., He, Y., Xia, M., Liu, Z.: Freetraj: Tuning-free tra- jectory control via noise guided video diffusion (2025),https://openreview.net/ forum?id=CU7QfWJ6nC

  44. [44]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714

  45. [45]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

  46. [46]

    In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Gontijo-Lopes, R., Ayan, B.K., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understand- ing. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural In- formation Processing Systems (...

  47. [47]

    SIGGRAPH 2024 (2024)

    Shi, X., Huang, Z., Wang, F.Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H., et al.: Motion-i2v: Consistent and controllable image-to- video generation with explicit motion modeling. SIGGRAPH 2024 (2024)

  48. [48]

    Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conferenceon LearningRepresentations(2021),https://openreview.net/forum? id=St1giarCHLP

  49. [49]

    Sudhakar, S., Liu, R., Hoorick, B.V., Vondrick, C., Zemel, R.: Controlling the world by sleight of hand (2024),https://arxiv.org/abs/2408.07147

  50. [50]

    In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-or, D., Bermano, A.H.: Human motion diffusion model. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=SJ1kSyO2jwu

  51. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  52. [52]

    Advances in Neural Information Processing Systems36, 7594–7611 (2023)

    Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. Advances in Neural Information Processing Systems36, 7594–7611 (2023)

  53. [53]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  54. [54]

    Wu, W., Li, Z., Gu, Y., Zhao, R., He, Y., Zhang, D.J., Shou, M.Z., Li, Y., Gao, T., Zhang, D.: Draganything: Motion control for anything using entity representation (2024),https://arxiv.org/abs/2403.07420

  55. [55]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  56. [56]

    arXiv preprint arxiv:2311.17009 (2023) CoMoGen 19

    Yatim, D., Fridman, R., Bar-Tal, O., Kasten, Y., Dekel, T.: Space-time diffusion features for zero-shot text-driven motion transfer. arXiv preprint arxiv:2311.17009 (2023) CoMoGen 19

  57. [57]

    In: ICLR (2020)

    Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: CLEVRER: collision events for video representation and reasoning. In: ICLR (2020)

  58. [58]

    Yin, S., Wu, C., Liang, J., Shi, J., Li, H., Ming, G., Duan, N.: Dragnuwa: Fine- grainedcontrolinvideogenerationbyintegratingtext,image,andtrajectory.arXiv preprint arXiv:2308.08089 (2023)

  59. [59]

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

  60. [60]

    arXiv (2025)

    Zhang, Y., Butt, A.A., Varol, G., Laptev, I.: Interpose: Learning to generate human-object interactions from large-scale web videos. arXiv (2025)

  61. [62]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhang, Z., Liao, J., Li, M., Dai, Z., Qiu, B., Zhu, S., Qin, L., Wang, W.: Tora: Trajectory-oriented diffusion transformer for video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2063–2073 (2025)

  62. [63]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024) 20 Meric et al. A Implementation Details We train models on two different datasets: one on CLEVRER [57] and one on BEHAVE [6]. This mirrors the experimental goal of isolati...