TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

Avi Ben-Cohen; Inbar Huberman-Spiegelglas; Michael Rotman; Omer Sela; Sagie Benaim

arxiv: 2607.00861 · v1 · pith:HJL6XPCInew · submitted 2026-07-01 · 💻 cs.CV · cs.GR

TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

Omer Sela , Inbar Huberman-Spiegelglas , Michael Rotman , Sagie Benaim , Avi Ben-Cohen This is my paper

Pith reviewed 2026-07-02 14:19 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords multi-object trajectory controlimage-to-video generationattention localizationGaussian heatmapsobject identity preservationmotion controlvideo synthesis

0 comments

The pith

Substituting cross-attention weights with per-object Gaussian heatmaps isolates trajectories for multi-object video control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that multi-object motion control in image-to-video generation works better when each object's trajectory is enforced independently inside the attention layers rather than through entangled shared signals. A sympathetic reader would care because existing methods lose object identities and fail to follow distinct paths accurately once scenes become crowded or paths cross. The approach achieves isolation by replacing the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same token interface also carries trajectory and depth information while first-frame appearance encodes identity. Tests on six datasets with up to twenty objects and two different backbones show consistent gains in visual quality and path accuracy.

Core claim

TrajLoc enforces strict per-object spatial constraints directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per-object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token.

What carries the argument

Substitution of cross-attention weights with per-object Gaussian heatmaps centered on target locations at every frame.

If this is right

Achieves average gains of +4.3 dB PSNR in visual fidelity across datasets.
Reduces trajectory end point error by 51 percent relative to strongest baselines.
Scales to scenes containing up to 20 simultaneously controlled objects.
Applies to two architecturally distinct video generation backbones.
Maintains improvements on out-of-distribution real-world scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-object token interface could support additional conditioning signals such as velocity or interaction rules without redesigning the attention structure.
The method may extend to tasks that require spatial localization in other generative domains such as image editing or 3D asset animation.
Overlapping heatmaps at intersection points could be monitored to detect and resolve potential identity swaps automatically.

Load-bearing premise

The per-object Gaussian heatmaps isolate instances and enforce spatial constraints independently without introducing artifacts or breaking coherent video synthesis when paths intersect or occlude.

What would settle it

Apply the method to a video scene where two object trajectories cross or one occludes the other and check whether object identities merge or video coherence visibly breaks.

Figures

Figures reproduced from arXiv: 2607.00861 by Avi Ben-Cohen, Inbar Huberman-Spiegelglas, Michael Rotman, Omer Sela, Sagie Benaim.

**Figure 1.** Figure 1: TrajLoc. Given a first frame and a set of target trajectories (left column, with colored polylines), the goal is to generate a video that moves each object along its prescribed path while preserving its visual identity. Top: multiple pedestrians on a synthetic urban scene. Bottom: sheep in a natural outdoor scene. The remaining columns show three uniformly spaced generated frames with the ground-truth posi… view at source ↗

**Figure 2.** Figure 2: An overview of TrajLoc. A structured text prompt “Scene where o0 moves [traj0 ], o1 moves [traj1 ], . . . ” is constructed, where each oi is the object’s given category name. The trajectory tokens [traji ] are replaced with learned embeddings from the pretrained (frozen) Enctraj, which independently encodes each target trajectory (xi(t), yi(t), di(t)). The learned appearance encoder Encapp encodes each obj… view at source ↗

**Figure 3.** Figure 3: Trajectory autoencoder pretraining. The trajectory encoder maps each object trajectory τi(t) = (xi(t), yi(t), di(t)) and a temporal position channel to a token embedding ⟨traji ⟩ in the text encoder space. The embedding passes through the frozen text encoder before a decoder reconstructs the original trajectory, ensuring the representation remains informative after text encoder processing. where m ∈ {1, . … view at source ↗

**Figure 4.** Figure 4: Each row shows three generated frames spanning the video (frames 9, 29, 49) where colored [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. Top-left: CogVideoX-5B results on DAVIS fish (5 objects, real-world). Top-right: WaN 2.1-14B results on MOT17 (4 pedestrians, real-world). Bottom-left: CogVideoX-5B results on MoVi-Extended (6 objects, synthetic). Bottom-right: WaN 2.1-14B results on MOTSynth (10 pedestrians, synthetic). Each row shows three generated frames from a different method, with ground-truth object position… view at source ↗

read the original abstract

Controlling the motion of multiple objects in image-to-video (I2V) generation requires preserving object identities while enforcing adherence to distinct target trajectories. This becomes particularly challenging as the number of objects increases and their paths intersect or occlude one another. Existing approaches entangle multiple trajectories within a shared, dense conditioning signal, making object-level correspondence difficult to preserve in crowded scenes. We depart from this paradigm and enforce a strict, per object spatial constraint that isolates instances independently. Our method, TrajLoc, achieves this directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token. Evaluations across six datasets, featuring up to 20 simultaneously controlled objects and out of distribution real world scenes, demonstrate that our method consistently improves both visual fidelity and trajectory adherence. Applied to two architecturally distinct backbones (CogVideoX 5B and WaN 2.1 14B), our approach achieves average gains of +4.3 dB PSNR and a 51% reduction in trajectory end point error compared to the strongest baselines. Project page: https://sela-omer.github.io/traj-loc/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajLoc swaps per-object Gaussian heatmaps into cross-attention to localize trajectories independently, delivering reported gains on two large backbones but leaving intersection cases unexamined.

read the letter

The main thing to know is that TrajLoc replaces each object's cross-attention weights with an independent Gaussian heatmap centered on its target location at every frame. This is a direct architectural change inside existing attention layers rather than another shared conditioning signal.

It does a couple of things cleanly. The same per-object token carries both trajectory and depth through a learned embedding while the first-frame appearance stands in for identity. They apply the change to two architecturally different models, CogVideoX 5B and WaN 2.1 14B, and test on six datasets that include up to 20 simultaneous objects plus out-of-distribution real scenes. The headline numbers are a 4.3 dB PSNR lift and 51% drop in endpoint error versus the strongest baselines.

The soft spot is exactly the one the stress-test flags. When paths cross or objects occlude, the joint nature of attention means the independent masks could either leak information or override the model's learned interaction cues. The paper gives only aggregate scores; nothing isolates performance on the intersecting subset that the introduction itself calls hardest. Without attention visualizations, per-scenario breakdowns, or error bars, it is difficult to tell whether the gains hold where the problem is most acute.

This is for people working on scalable object control in video diffusion models. A reader who wants a concrete attention-level trick and results on real-scale backbones will get something usable from it. The mechanism is specific enough and the evaluation broad enough that it deserves a serious referee, though the review should press for targeted experiments on crossings and occlusions.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TrajLoc for multi-object motion control in image-to-video generation. It enforces per-object spatial constraints by substituting cross-attention weights of each object token with an independent Gaussian heatmap centered on the target location at every frame, while carrying trajectory and depth via a learned embedding and preserving identity through first-frame appearance encoding. The authors claim that this yields consistent gains in visual fidelity and trajectory adherence, with average improvements of +4.3 dB PSNR and 51% reduction in endpoint error versus strongest baselines when applied to CogVideoX 5B and WaN 2.1 14B across six datasets featuring up to 20 objects and out-of-distribution real-world scenes.

Significance. If the reported gains prove robust, the approach would provide a lightweight, backbone-agnostic mechanism for object isolation inside existing attention layers of large video diffusion models. The evaluation across two architecturally distinct backbones and on out-of-distribution scenes is a strength that supports broader applicability.

major comments (2)

[Abstract] Abstract: The central claim concerns performance in crowded scenes where paths intersect or occlude, yet the reported aggregate metrics (+4.3 dB PSNR, 51% EPE reduction) provide no breakdown or separate results on the intersecting/occluding subset. This leaves the load-bearing assumption that independent Gaussian substitutions preserve coherence without identity leakage or artifacts untested by the presented evidence.
[Method] Method description: The substitution of cross-attention weights with per-object Gaussians is presented as operating directly inside the model's attention layers, but no equations, pseudocode, or implementation details specify whether the replacement occurs before or after softmax, per attention head, or with cross-object normalization. This underspecification directly affects whether the joint attention computation can still model interactions when trajectories cross.

minor comments (2)

The quantitative claims would be strengthened by reporting error bars, standard deviations, or per-dataset breakdowns rather than averages alone.
Dataset statistics (number of sequences, resolution, trajectory generation procedure, and annotation protocol) are not described, which hinders assessment of the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation breakdown and method specification. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim concerns performance in crowded scenes where paths intersect or occlude, yet the reported aggregate metrics (+4.3 dB PSNR, 51% EPE reduction) provide no breakdown or separate results on the intersecting/occluding subset. This leaves the load-bearing assumption that independent Gaussian substitutions preserve coherence without identity leakage or artifacts untested by the presented evidence.

Authors: We agree that the central claim focuses on crowded scenes with intersections and occlusions, and that aggregate metrics alone leave this aspect under-tested. The six datasets do contain such cases (up to 20 objects), but no subset analysis is currently reported. In the revised manuscript we will add a dedicated breakdown of PSNR and endpoint error on the intersecting/occluding subset to directly evaluate coherence preservation. revision: yes
Referee: [Method] Method description: The substitution of cross-attention weights with per-object Gaussians is presented as operating directly inside the model's attention layers, but no equations, pseudocode, or implementation details specify whether the replacement occurs before or after softmax, per attention head, or with cross-object normalization. This underspecification directly affects whether the joint attention computation can still model interactions when trajectories cross.

Authors: The current manuscript does not provide these implementation specifics, which is an oversight in the method description. We will revise the method section to include explicit equations and pseudocode clarifying the substitution timing (relative to softmax), per-head application, and normalization procedure. This will make the interaction modeling behavior reproducible and address the concern about cross-trajectory coherence. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural substitution applied to external backbones

full rationale

The paper presents TrajLoc as a direct per-object substitution of cross-attention weights by Gaussian heatmaps inside unmodified diffusion backbones (CogVideoX 5B, WaN 2.1 14B). No equations, fitted parameters, or predictions are shown that reduce the reported PSNR/EPE gains to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central mechanism and empirical results on six datasets stand as an independent architectural change evaluated against external baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger entries are therefore minimal and provisional.

free parameters (1)

learned embedding for trajectory and depth
Mentioned as the carrier for trajectory and depth information through the per-object token interface.

axioms (1)

domain assumption Gaussian heatmap substitution isolates object instances and enforces spatial constraints without side effects on coherence
Central to the method description in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1192 out tokens · 23425 ms · 2026-07-02T14:19:38.351814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. InSIGGRAPH, 2023

2023
[3]

Motion-zero: Zero-shot moving object control framework for diffusion-based video generation

Changgu Chen, Junwei Shu, Gaoqi He, Changbo Wang, and Yang Li. Motion-zero: Zero-shot moving object control framework for diffusion-based video generation. InAAAI, 2025

2025
[4]

Wan-move: Motion-controllable video generation via latent trajectory guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-move: Motion-controllable video generation via latent trajectory guidance. InNeurIPS, 2025

2025
[5]

Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

2021
[6]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023

2023
[7]

Evangelidis and Emmanouil Z

Georgios D. Evangelidis and Emmanouil Z. Psarakis. Parametric image alignment using enhanced correlation coefficient maximization.IEEE Trans. Pattern Anal. Mach. Intell., 30(10): 1858–1865, 2008. doi: 10.1109/TPAMI.2008.113

work page doi:10.1109/tpami.2008.113 2008
[8]

Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun Cetintas, Riccardo Gasparini, Aljoša Ošep, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

2021
[9]

Two-frame motion estimation based on polynomial expansion

Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InImage Analysis (SCIA 2003), volume 2749 ofLecture Notes in Computer Science, pages 363–370. Springer, 2003. doi: 10.1007/3-540-45103-X_50

work page doi:10.1007/3-540-45103-x_50 2003
[10]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Ta- tiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories. InCVPR, 2025

2025
[11]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, pages 22404–22415, 2025

2025
[12]

Prompt-to-prompt image editing with cross attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023

2023
[13]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

2022
[15]

Peekaboo: Interactive video generation via masked-diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. InCVPR, 2024

2024
[16]

Posetraj: Pose-aware trajectory control in video diffusion

Longbin Ji, Lei Zhong, Pengfei Wei, and Changjian Li. Posetraj: Pose-aware trajectory control in video diffusion. InCVPR, 2025. 10

2025
[17]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InICCV, 2025

2025
[18]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023

2023
[19]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning (CoRL), 2024

2024
[20]

Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Lei Sun, Luc Van Gool, and Danda Pani Paudel. Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

work page arXiv 2024
[21]

Trailblazer: Trajectory control for diffusion-based video generation

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation. InSIGGRAPH Asia, 2024

2024
[22]

Sg-i2v: Self-guided trajectory control in image-to-video generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, and David B Lindell. Sg-i2v: Self-guided trajectory control in image-to-video generation. InICLR, 2025

2025
[23]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Freetraj: Tuning-free trajectory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models, 2024. URL https://arxiv.org/ abs/2406.16863

work page arXiv 2024
[25]

Towards accurate generative models of video: A new metric & challenges,

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges,
[26]

URLhttps://arxiv.org/abs/1812.01717

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Dragentity: Trajectory guided video generation using entity and positional relationships

Zhang Wan, Sheng Tang, Jiawei Wei, Ruize Zhang, and Juan Cao. Dragentity: Trajectory guided video generation using entity and positional relationships. InACM MM, 2024

2024
[29]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

work page arXiv 2025
[30]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InCVPR, 2025

2025
[31]

Boximator: Generating rich and controllable motions for video synthesis

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis. InICML, 2024

2024
[32]

Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIGGRAPH, 2025

2025
[33]

Drive- Dreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- Dreamer: Towards real-world-driven world models for autonomous driving. InECCV, 2024

2024
[34]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

2024
[35]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InECCV, 2024. 11

2024
[36]

Motioncanvas: Cinematic shot design with controllable image- to-video generation

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image- to-video generation. InSIGGRAPH, 2025

2025
[37]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeurIPS, 2024

2024
[38]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InICLR, 2025

2025
[39]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

2023
[40]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[41]

Tora: Trajectory-oriented diffusion transformer for video generation

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. In CVPR, 2025

2025
[42]

Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

Zhiyuan Zhang, Can Wang, Dongdong Chen, and Jing Liao. Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

work page arXiv 2025
[43]

Motionpro: A precise motion controller for image-to-video generation

Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise motion controller for image-to-video generation. InCVPR, 2025

2025
[44]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 12 Appendix Contents A Additional Qualitative Comparisons 13 B Failure Cases 16 B.1 GTA-V Training-Distribution Leakage on Out-of-Distribu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. InSIGGRAPH, 2023

2023

[3] [3]

Motion-zero: Zero-shot moving object control framework for diffusion-based video generation

Changgu Chen, Junwei Shu, Gaoqi He, Changbo Wang, and Yang Li. Motion-zero: Zero-shot moving object control framework for diffusion-based video generation. InAAAI, 2025

2025

[4] [4]

Wan-move: Motion-controllable video generation via latent trajectory guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang. Wan-move: Motion-controllable video generation via latent trajectory guidance. InNeurIPS, 2025

2025

[5] [5]

Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé. Motchallenge: A benchmark for single-camera multiple target tracking.IJCV, 2021

2021

[6] [6]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InNeurIPS, 2023

2023

[7] [7]

Evangelidis and Emmanouil Z

Georgios D. Evangelidis and Emmanouil Z. Psarakis. Parametric image alignment using enhanced correlation coefficient maximization.IEEE Trans. Pattern Anal. Mach. Intell., 30(10): 1858–1865, 2008. doi: 10.1109/TPAMI.2008.113

work page doi:10.1109/tpami.2008.113 2008

[8] [8]

Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun Cetintas, Riccardo Gasparini, Aljoša Ošep, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How can synthetic data help pedestrian detection and tracking? InICCV, 2021

2021

[9] [9]

Two-frame motion estimation based on polynomial expansion

Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InImage Analysis (SCIA 2003), volume 2749 ofLecture Notes in Computer Science, pages 363–370. Springer, 2003. doi: 10.1007/3-540-45103-X_50

work page doi:10.1007/3-540-45103-x_50 2003

[10] [10]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Ta- tiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, et al. Motion prompting: Controlling video generation with motion trajectories. InCVPR, 2025

2025

[11] [11]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InCVPR, pages 22404–22415, 2025

2025

[12] [12]

Prompt-to-prompt image editing with cross attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023

2023

[13] [13]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022

2022

[15] [15]

Peekaboo: Interactive video generation via masked-diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. InCVPR, 2024

2024

[16] [16]

Posetraj: Pose-aware trajectory control in video diffusion

Longbin Ji, Lei Zhong, Pengfei Wei, and Changjian Li. Posetraj: Pose-aware trajectory control in video diffusion. InCVPR, 2025. 10

2025

[17] [17]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance. InICCV, 2025

2025

[18] [18]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InCVPR, 2023

2023

[19] [19]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning (CoRL), 2024

2024

[20] [20]

Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Lei Sun, Luc Van Gool, and Danda Pani Paudel. Intragen: Trajectory- controlled video generation for object interactions.arXiv preprint arXiv:2411.16804, 2024

work page arXiv 2024

[21] [21]

Trailblazer: Trajectory control for diffusion-based video generation

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation. InSIGGRAPH Asia, 2024

2024

[22] [22]

Sg-i2v: Self-guided trajectory control in image-to-video generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, and David B Lindell. Sg-i2v: Self-guided trajectory control in image-to-video generation. InICLR, 2025

2025

[23] [23]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Freetraj: Tuning-free trajectory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models, 2024. URL https://arxiv.org/ abs/2406.16863

work page arXiv 2024

[25] [25]

Towards accurate generative models of video: A new metric & challenges,

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges,

[26] [26]

URLhttps://arxiv.org/abs/1812.01717

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Dragentity: Trajectory guided video generation using entity and positional relationships

Zhang Wan, Sheng Tang, Jiawei Wei, Ruize Zhang, and Juan Cao. Dragentity: Trajectory guided video generation using entity and positional relationships. InACM MM, 2024

2024

[29] [29]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

work page arXiv 2025

[30] [30]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InCVPR, 2025

2025

[31] [31]

Boximator: Generating rich and controllable motions for video synthesis

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis. InICML, 2024

2024

[32] [32]

Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation

Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. InSIGGRAPH, 2025

2025

[33] [33]

Drive- Dreamer: Towards real-world-driven world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- Dreamer: Towards real-world-driven world models for autonomous driving. InECCV, 2024

2024

[34] [34]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

2024

[35] [35]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InECCV, 2024. 11

2024

[36] [36]

Motioncanvas: Cinematic shot design with controllable image- to-video generation

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu. Motioncanvas: Cinematic shot design with controllable image- to-video generation. InSIGGRAPH, 2025

2025

[37] [37]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeurIPS, 2024

2024

[38] [38]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InICLR, 2025

2025

[39] [39]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

2023

[40] [40]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[41] [41]

Tora: Trajectory-oriented diffusion transformer for video generation

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. In CVPR, 2025

2025

[42] [42]

Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

Zhiyuan Zhang, Can Wang, Dongdong Chen, and Jing Liao. Flextraj: Image-to-video generation with flexible point trajectory control.arXiv preprint arXiv:2510.08527, 2025

work page arXiv 2025

[43] [43]

Motionpro: A precise motion controller for image-to-video generation

Zhongwei Zhang, Fuchen Long, Zhaofan Qiu, Yingwei Pan, Wu Liu, Ting Yao, and Tao Mei. Motionpro: A precise motion controller for image-to-video generation. InCVPR, 2025

2025

[44] [44]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. 12 Appendix Contents A Additional Qualitative Comparisons 13 B Failure Cases 16 B.1 GTA-V Training-Distribution Leakage on Out-of-Distribu...

work page internal anchor Pith review Pith/arXiv arXiv 2024