arxiv: 2604.09201 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Haoyu Zhao , Zihao Zhang , Jiaxi Gu , Haoran Chen , Qingping Zheng , Pin Tang , Yeyin Jin , Yuang Zhang

show 5 more authors

Junqi Cheng Zenghui Lu Peng Shu Zuxuan Wu Yu-Gang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera-controllable video generationvision-language modelscamera trajectory estimationvideo diffusion modelsspatial reasoningwavelet regularizationCT-200K dataset

0 comments

The pith

CT-1 transfers spatial reasoning from vision-language inputs to predict camera trajectories for controlled video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CT-1, a model that estimates camera trajectories from vision and language inputs to guide video synthesis models. It trains on a curated dataset of over 47 million frames using a wavelet-based regularization loss in the frequency domain to capture realistic trajectory patterns. These trajectories integrate into a video diffusion model to produce camera movements that align with user intentions more accurately than text prompts or manual parameters allow. A reader would care because this removes the need for imprecise descriptions or labor-intensive setup in automated video creation. Experiments report a 25.7 percent gain in control accuracy along with more faithful and high-quality outputs.

Core claim

CT-1, built upon vision-language modules and a Diffusion Transformer model, employs a Wavelet-based Regularization Loss in the frequency domain to learn complex camera trajectory distributions from the CT-200K dataset. These estimated trajectories integrate into a video diffusion model to enable spatially aware camera control that aligns with user intentions, bridging spatial reasoning with video synthesis to yield faithful and high-quality camera-controllable videos.

What carries the argument

CT-1 model that uses vision-language modules and Wavelet-based Regularization Loss to estimate camera trajectories for integration into video diffusion models.

Load-bearing premise

That trajectory estimates produced by the vision-language model after wavelet regularization will translate into video outputs whose camera movements match the user's intended spatial paths when inserted into the diffusion model.

What would settle it

A benchmark evaluation on videos with complex or out-of-distribution camera movements where the generated results show no accuracy gain or visible misalignment with the requested paths would falsify the claim that the transfer produces reliable control.

read the original abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CT-1 adds a trajectory-estimation model and a large new dataset to camera-controlled video diffusion, but the 25.7% accuracy claim needs clearer trajectory metrics and integration details to be convincing.

read the letter

The paper's real addition is CT-1, a vision-language transformer that predicts camera trajectories, paired with the CT-200K dataset of over 47 million frames and a wavelet regularization loss to handle trajectory distributions. Those trajectories then condition a video diffusion model. This setup directly targets the gap between loose text prompts and precise camera moves without manual input each time. The wavelet loss in the frequency domain is a sensible choice for capturing the kinds of smooth and complex motions that appear in real video. The data curation pipeline is also a concrete piece of work that could be reused even if the model itself is not adopted wholesale. The central claim of a 25.7% accuracy lift over prior methods is the part that stays thin. The abstract gives no trajectory-level numbers such as translation or rotation error on a held-out set, no list of baselines, and no ablation on how the trajectories are actually injected into the diffusion backbone. Without those, it is hard to know whether the reported gain comes from better spatial reasoning or from other pipeline tweaks. The dataset description likewise skips basic statistics on motion diversity or the train-test split, so it is difficult to judge how well it matches the camera controls users actually request. This work is aimed at researchers building controllable video generators. Someone looking for a new trajectory dataset or an example of frequency-domain regularization might still pull useful pieces from it. I would send it to peer review. The idea is practical and the dataset is a tangible contribution, but the experimental section will need more transparent numbers before the accuracy story can be taken at face value.

Referee Report

3 major / 1 minor

Summary. The paper proposes CT-1, a Vision-Language-Camera model built on vision-language modules and a Diffusion Transformer that employs a Wavelet-based Regularization Loss in the frequency domain to estimate camera trajectories. These trajectories are integrated into a video diffusion model for camera-controllable video generation. The work introduces a data curation pipeline and the CT-200K dataset (>47M frames) to train the model, claiming that the framework bridges spatial reasoning and video synthesis to produce faithful videos with a 25.7% improvement in camera control accuracy over prior methods.

Significance. If the experimental results hold, the work could meaningfully advance camera-controllable video generation by enabling automated, intention-aligned control without manual trajectory inputs, leveraging spatial knowledge from vision-language models. The large-scale CT-200K dataset and frequency-domain regularization for trajectory learning are concrete contributions that could support future research in this area.

major comments (3)

[Abstract] Abstract: The central claim of a 25.7% improvement in camera control accuracy is presented without any description of the accuracy metric, the specific prior methods used as baselines, statistical tests, ablation studies, or quantitative trajectory-level results (e.g., translation/rotation error). This detail is load-bearing for attributing the gain to the claimed spatial-reasoning transfer rather than other model changes.
[Method] Method section (trajectory integration): The description states that CT-1 trajectories are 'integrated' into the video diffusion model but provides no architectural specifics on the conditioning mechanism (cross-attention, feature concatenation, or otherwise). Without this and supporting trajectory RMSE metrics on held-out data, the transfer of spatial accuracy cannot be verified.
[Dataset] Dataset section: CT-200K is asserted to contain >47M frames, yet no quantitative statistics on trajectory diversity, camera-motion distribution, or train/test split are supplied. This information is required to evaluate whether the learned distribution supports the claim of alignment with user intentions.

minor comments (1)

[Abstract] Abstract: The acronym expansion 'CT-1 (Camera Transformer 1)' is given but its precise architectural relation to standard Diffusion Transformers is not elaborated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would benefit from greater clarity and detail in the areas highlighted and will revise accordingly. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 25.7% improvement in camera control accuracy is presented without any description of the accuracy metric, the specific prior methods used as baselines, statistical tests, ablation studies, or quantitative trajectory-level results (e.g., translation/rotation error). This detail is load-bearing for attributing the gain to the claimed spatial-reasoning transfer rather than other model changes.

Authors: We agree the abstract is too concise on this point. In the revised manuscript we will expand it to briefly define the camera control accuracy metric (combined translation/rotation trajectory error), name the primary baselines, and note that ablations, statistical tests, and component-wise trajectory results appear in the experiments section. This will better substantiate that the reported gain arises from the spatial-reasoning transfer. revision: yes
Referee: [Method] Method section (trajectory integration): The description states that CT-1 trajectories are 'integrated' into the video diffusion model but provides no architectural specifics on the conditioning mechanism (cross-attention, feature concatenation, or otherwise). Without this and supporting trajectory RMSE metrics on held-out data, the transfer of spatial accuracy cannot be verified.

Authors: We will add the requested architectural details: camera trajectories predicted by CT-1 are encoded into tokens and injected into the video diffusion model's Diffusion Transformer via cross-attention at multiple layers. We will also report trajectory RMSE (translation and rotation separately) on held-out data to directly demonstrate the accuracy of the transferred spatial reasoning. revision: yes
Referee: [Dataset] Dataset section: CT-200K is asserted to contain >47M frames, yet no quantitative statistics on trajectory diversity, camera-motion distribution, or train/test split are supplied. This information is required to evaluate whether the learned distribution supports the claim of alignment with user intentions.

Authors: We will augment the dataset section with the missing statistics: histograms and summary measures of translation/rotation distributions, quantitative diversity metrics across motion types, and explicit train/test split information (including scene-level separation). These additions will allow readers to assess coverage of user-intention-aligned trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on experimental outcomes from trained models and curated data

full rationale

The paper introduces CT-1 as a vision-language model that estimates camera trajectories via a Diffusion Transformer backbone and a wavelet regularization loss, then integrates those trajectories into a video diffusion model for controllable generation. All load-bearing claims (25.7% accuracy improvement, faithful camera control) are presented as results of training on the CT-200K dataset and empirical evaluation; no equations, derivations, or self-referential definitions appear that would make any prediction equivalent to its inputs by construction. The framework builds on standard vision-language and diffusion components without invoking uniqueness theorems, self-citations for ansatzes, or renaming of known results as novel organization. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review provides limited visibility into technical details; the central claim rests on the assumption that vision-language modules can reliably extract camera trajectories and that the wavelet loss helps model their distributions.

axioms (2)

domain assumption Vision-language modules can accurately estimate complex camera trajectories from visual and textual inputs.
This is the core transfer mechanism stated in the abstract.
ad hoc to paper Wavelet-based regularization in the frequency domain effectively captures complex camera trajectory distributions.
Introduced as a key training component without prior justification in the abstract.

invented entities (2)

CT-1 (Camera Transformer 1) model no independent evidence
purpose: Specialized architecture to transfer spatial reasoning to camera trajectory prediction for video generation.
New model proposed and built upon vision-language modules and Diffusion Transformer.
CT-200K dataset no independent evidence
purpose: Large-scale training data containing over 47M frames for CT-1.
Constructed via a dedicated data curation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5541 in / 1432 out tokens · 44997 ms · 2026-05-10T16:49:23.738739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

work page arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shĳie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shĳie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXivpreprintarXiv:2511.22699, 2025

work page internal anchor Pith review arXiv 2025
[5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

2024
[6]

Large language models are visual reasoning coordinators.NeurIPS, 36:70115–70140, 2023

Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, and Ziwei Liu. Large language models are visual reasoning coordinators.NeurIPS, 36:70115–70140, 2023

2023
[7]

Learning to prompt for open-vocabulary object detection with vision-language model

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. InCVPR, pages 14084–14093, 2022

2022
[8]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review arXiv 2024
[9]

Cameractrl: Enabling camera control for video diffusion models

HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

2025
[10]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

work page arXiv 2025
[11]

Learning camera movement control from real-world drone videos

Yunzhong Hou, Liang Zheng, and Philip Torr. Learning camera movement control from real-world drone videos. arXiv preprintarXiv:2412.09620, 2024

work page arXiv 2024
[12]

Vbench: Comprehensive benchmark suite for video generative models

ZiqiHuang, YinanHe, JiashuoYu, FanZhang, ChenyangSi, YumingJiang, YuanhanZhang, TianxingWu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

2024
[13]

Pexels-400k

jovianzm. Pexels-400k. https://huggingface.co/datasets/jovianzm/Pexels-400k, jan 2025. Accessed: 2025-03- 07

2025
[14]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXivpreprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXivpreprintarXiv:2502.19645, 2025

work page internal anchor Pith review arXiv 2025
[16]

Training-free guidance in text-to-video generation via multimodal planning and structured noise initialization.arXivpreprint arXiv:2504.08641, 2025

Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, and Mohit Bansal. Training-free guidance in text-to-video generation via multimodal planning and structured noise initialization.arXivpreprint arXiv:2504.08641, 2025

work page arXiv 2025
[17]

Two-step lidar/camera/imu spatial and temporal calibration based on continuous-time trajectory estimation.IEEETransactionsonIndustrialElectronics, 71 (3):3182–3191, 2023

Shengyu Li, Xingxing Li, Shuolong Chen, Yuxuan Zhou, and Shiwen Wang. Two-step lidar/camera/imu spatial and temporal calibration based on continuous-time trajectory estimation.IEEETransactionsonIndustrialElectronics, 71 (3):3182–3191, 2023

2023
[18]

Towards understanding camera motions in any video

Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Yu Tong Tiffany Ling, Yuhan Huang, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, and Deva Ramanan. Towards understanding camera motions in any video. InNeurIPSDBTrack, 2025. 13

2025
[19]

Chatcam: Empowering camera control through conversational ai

Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Chatcam: Empowering camera control through conversational ai. NeurIPS, 37:54483–54506, 2024

2024
[20]

Egoschema: A diagnostic benchmark for very long-form video language understanding.NeurIPS, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.NeurIPS, 36:46212–46244, 2023

2023
[21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprintarXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

2023
[23]

arXiv preprint arXiv:2501.19061 (2025)

Heqian Qiu, Zhaofeng Shi, Lanxiao Wang, Huiyu Xiong, Xiang Li, and Hongliang Li. Egome: A new dataset and challenge for following me via egocentric view in real world.arXivpreprintarXiv:2501.19061, 2025

work page arXiv 2025
[24]

Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, pages 6121–6132, 2025

2025
[25]

Dynamic camera poses and where to find them

Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F Fouhey, and Chen-Hsuan Lin. Dynamic camera poses and where to find them. InCVPR, pages 12444–12455, 2025

2025
[26]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, SoumyaBatra,PrajjwalBhargava,ShrutiBhosale,etal. Llama2: Openfoundationandfine-tunedchatmodels. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Uniadapter: All-in-one control for flexible video generation.TCSVT, 2025

Cong Wang, Panwen Hu, Haoyu Zhao, Yuanfan Guo, Jiaxi Gu, Xiao Dong, Jianhua Han, Hang Xu, and Xiaodan Liang. Uniadapter: All-in-one control for flexible video generation.TCSVT, 2025

2025
[29]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

2025
[30]

Tram: Global trajectory and motion of 3d humans from in-the-wild videos

Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. InECCV, pages 467–487. Springer, 2024

2024
[31]

YuqiWang,KeCheng,JiaweiHe,QitaiWang,HengchenDai,YuntaoChen,FeiXia,andZhaoxiangZhang.Drivingdojo dataset: Advancing interactive and knowledge-enriched driving world model.arXiv preprint arXiv:2410.10738, 2024

work page arXiv 2024
[32]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH, pages 1–11, 2024

2024
[33]

Vary: Scaling up the vision vocabulary for large vision-language model

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. InECCV, pages 408–424. Springer, 2024

2024
[34]

St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprint arXiv:2503.12542, 2025

Peiran Wu, Yunze Liu, Miao Liu, and Junxiao Shen. St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprintarXiv:2503.12542, 2025

work page arXiv 2025
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[37]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InSIGGRAPH Asia, pages 1–11, 2025

2025
[38]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025. 14

2025
[39]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023

2023
[40]

Cameranoise: Learning precise camera control with video diffusion in noise space.https://openreview.net/forum?id=TT3gmYaqyc, 2025

Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, and Zuxuan Wu. Cameranoise: Learning precise camera control with video diffusion in noise space.https://openreview.net/forum?id=TT3gmYaqyc, 2025. Accessed: 2025-09-14

2025
[41]

Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing

Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Qingping Zheng, Zuxuan Wu, Hang Xu, and Yu-Gang Jiang. Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing. InECCV, pages 205–221, 2025

2025
[42]

Lstd: Long short-term temporal diffusion for video generation.TMM, page doi: 10.1109/TMM.2026.3651052, 2026

Haoyu Zhao, Jiaxi Gu, Shicong Wang, Tianyi Lu, Xing Zhang, Zuxuan Wu, Hang Xu, and Yu-Gang Jiang. Lstd: Long short-term temporal diffusion for video generation.TMM, page doi: 10.1109/TMM.2026.3651052, 2026

work page doi:10.1109/tmm.2026.3651052 2026
[43]

Refocuseraser: Refocusing for small object removal with robust context-shadow repair

Qingping Zheng, Bo Huang, Yang Liu, Haoyu Zhao, Ling Zheng, Zengmao Wang, Ying Li, and Jiankang Deng. Refocuseraser: Refocusing for small object removal with robust context-shadow repair. InICLR
[44]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprintarXiv:1805.09817, 2018. 15 6 Appendix 6.1 More Implementation Details Evaluationmetrics.ThecameratrajectoriespredictedbyourproposedCT-1modelareinherentlynon-unique. This means that multiple trajector...

work page internal anchor Pith review arXiv 2018