pith. machine review for the scientific record. sign in

arxiv: 2604.09201 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera-controllable video generationvision-language modelscamera trajectory estimationvideo diffusion modelsspatial reasoningwavelet regularizationCT-200K dataset
0
0 comments X

The pith

CT-1 transfers spatial reasoning from vision-language inputs to predict camera trajectories for controlled video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CT-1, a model that estimates camera trajectories from vision and language inputs to guide video synthesis models. It trains on a curated dataset of over 47 million frames using a wavelet-based regularization loss in the frequency domain to capture realistic trajectory patterns. These trajectories integrate into a video diffusion model to produce camera movements that align with user intentions more accurately than text prompts or manual parameters allow. A reader would care because this removes the need for imprecise descriptions or labor-intensive setup in automated video creation. Experiments report a 25.7 percent gain in control accuracy along with more faithful and high-quality outputs.

Core claim

CT-1, built upon vision-language modules and a Diffusion Transformer model, employs a Wavelet-based Regularization Loss in the frequency domain to learn complex camera trajectory distributions from the CT-200K dataset. These estimated trajectories integrate into a video diffusion model to enable spatially aware camera control that aligns with user intentions, bridging spatial reasoning with video synthesis to yield faithful and high-quality camera-controllable videos.

What carries the argument

CT-1 model that uses vision-language modules and Wavelet-based Regularization Loss to estimate camera trajectories for integration into video diffusion models.

Load-bearing premise

That trajectory estimates produced by the vision-language model after wavelet regularization will translate into video outputs whose camera movements match the user's intended spatial paths when inserted into the diffusion model.

What would settle it

A benchmark evaluation on videos with complex or out-of-distribution camera movements where the generated results show no accuracy gain or visible misalignment with the requested paths would falsify the claim that the transfer produces reliable control.

read the original abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions. To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes CT-1, a Vision-Language-Camera model built on vision-language modules and a Diffusion Transformer that employs a Wavelet-based Regularization Loss in the frequency domain to estimate camera trajectories. These trajectories are integrated into a video diffusion model for camera-controllable video generation. The work introduces a data curation pipeline and the CT-200K dataset (>47M frames) to train the model, claiming that the framework bridges spatial reasoning and video synthesis to produce faithful videos with a 25.7% improvement in camera control accuracy over prior methods.

Significance. If the experimental results hold, the work could meaningfully advance camera-controllable video generation by enabling automated, intention-aligned control without manual trajectory inputs, leveraging spatial knowledge from vision-language models. The large-scale CT-200K dataset and frequency-domain regularization for trajectory learning are concrete contributions that could support future research in this area.

major comments (3)
  1. [Abstract] Abstract: The central claim of a 25.7% improvement in camera control accuracy is presented without any description of the accuracy metric, the specific prior methods used as baselines, statistical tests, ablation studies, or quantitative trajectory-level results (e.g., translation/rotation error). This detail is load-bearing for attributing the gain to the claimed spatial-reasoning transfer rather than other model changes.
  2. [Method] Method section (trajectory integration): The description states that CT-1 trajectories are 'integrated' into the video diffusion model but provides no architectural specifics on the conditioning mechanism (cross-attention, feature concatenation, or otherwise). Without this and supporting trajectory RMSE metrics on held-out data, the transfer of spatial accuracy cannot be verified.
  3. [Dataset] Dataset section: CT-200K is asserted to contain >47M frames, yet no quantitative statistics on trajectory diversity, camera-motion distribution, or train/test split are supplied. This information is required to evaluate whether the learned distribution supports the claim of alignment with user intentions.
minor comments (1)
  1. [Abstract] Abstract: The acronym expansion 'CT-1 (Camera Transformer 1)' is given but its precise architectural relation to standard Diffusion Transformers is not elaborated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would benefit from greater clarity and detail in the areas highlighted and will revise accordingly. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 25.7% improvement in camera control accuracy is presented without any description of the accuracy metric, the specific prior methods used as baselines, statistical tests, ablation studies, or quantitative trajectory-level results (e.g., translation/rotation error). This detail is load-bearing for attributing the gain to the claimed spatial-reasoning transfer rather than other model changes.

    Authors: We agree the abstract is too concise on this point. In the revised manuscript we will expand it to briefly define the camera control accuracy metric (combined translation/rotation trajectory error), name the primary baselines, and note that ablations, statistical tests, and component-wise trajectory results appear in the experiments section. This will better substantiate that the reported gain arises from the spatial-reasoning transfer. revision: yes

  2. Referee: [Method] Method section (trajectory integration): The description states that CT-1 trajectories are 'integrated' into the video diffusion model but provides no architectural specifics on the conditioning mechanism (cross-attention, feature concatenation, or otherwise). Without this and supporting trajectory RMSE metrics on held-out data, the transfer of spatial accuracy cannot be verified.

    Authors: We will add the requested architectural details: camera trajectories predicted by CT-1 are encoded into tokens and injected into the video diffusion model's Diffusion Transformer via cross-attention at multiple layers. We will also report trajectory RMSE (translation and rotation separately) on held-out data to directly demonstrate the accuracy of the transferred spatial reasoning. revision: yes

  3. Referee: [Dataset] Dataset section: CT-200K is asserted to contain >47M frames, yet no quantitative statistics on trajectory diversity, camera-motion distribution, or train/test split are supplied. This information is required to evaluate whether the learned distribution supports the claim of alignment with user intentions.

    Authors: We will augment the dataset section with the missing statistics: histograms and summary measures of translation/rotation distributions, quantitative diversity metrics across motion types, and explicit train/test split information (including scene-level separation). These additions will allow readers to assess coverage of user-intention-aligned trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on experimental outcomes from trained models and curated data

full rationale

The paper introduces CT-1 as a vision-language model that estimates camera trajectories via a Diffusion Transformer backbone and a wavelet regularization loss, then integrates those trajectories into a video diffusion model for controllable generation. All load-bearing claims (25.7% accuracy improvement, faithful camera control) are presented as results of training on the CT-200K dataset and empirical evaluation; no equations, derivations, or self-referential definitions appear that would make any prediction equivalent to its inputs by construction. The framework builds on standard vision-language and diffusion components without invoking uniqueness theorems, self-citations for ansatzes, or renaming of known results as novel organization. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review provides limited visibility into technical details; the central claim rests on the assumption that vision-language modules can reliably extract camera trajectories and that the wavelet loss helps model their distributions.

axioms (2)
  • domain assumption Vision-language modules can accurately estimate complex camera trajectories from visual and textual inputs.
    This is the core transfer mechanism stated in the abstract.
  • ad hoc to paper Wavelet-based regularization in the frequency domain effectively captures complex camera trajectory distributions.
    Introduced as a key training component without prior justification in the abstract.
invented entities (2)
  • CT-1 (Camera Transformer 1) model no independent evidence
    purpose: Specialized architecture to transfer spatial reasoning to camera trajectory prediction for video generation.
    New model proposed and built upon vision-language modules and Diffusion Transformer.
  • CT-200K dataset no independent evidence
    purpose: Large-scale training data containing over 47M frames for CT-1.
    Constructed via a dedicated data curation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5541 in / 1432 out tokens · 44997 ms · 2026-05-10T16:49:23.738739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXivpreprintarXiv:2511.21631, 2025

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXivpreprintarXiv:2502.13923, 2025

  4. [4]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXivpreprintarXiv:2511.22699, 2025

  5. [5]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, pages 7310–7320, 2024

  6. [6]

    Large language models are visual reasoning coordinators.NeurIPS, 36:70115–70140, 2023

    Liangyu Chen, Bo Li, Sheng Shen, Jingkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, and Ziwei Liu. Large language models are visual reasoning coordinators.NeurIPS, 36:70115–70140, 2023

  7. [7]

    Learning to prompt for open-vocabulary object detection with vision-language model

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. InCVPR, pages 14084–14093, 2022

  8. [8]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  9. [9]

    Cameractrl: Enabling camera control for video diffusion models

    HaoHe,YinghaoXu,YuweiGuo,GordonWetzstein,BoDai,HongshengLi,andCeyuanYang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025

  10. [10]

    Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models.arXiv preprint arXiv:2503.10592, 2025

  11. [11]

    Learning camera movement control from real-world drone videos

    Yunzhong Hou, Liang Zheng, and Philip Torr. Learning camera movement control from real-world drone videos. arXiv preprintarXiv:2412.09620, 2024

  12. [12]

    Vbench: Comprehensive benchmark suite for video generative models

    ZiqiHuang, YinanHe, JiashuoYu, FanZhang, ChenyangSi, YumingJiang, YuanhanZhang, TianxingWu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

  13. [13]

    Pexels-400k

    jovianzm. Pexels-400k. https://huggingface.co/datasets/jovianzm/Pexels-400k, jan 2025. Accessed: 2025-03- 07

  14. [14]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXivpreprint arXiv:2406.09246, 2024

  15. [15]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXivpreprintarXiv:2502.19645, 2025

  16. [16]

    Training-free guidance in text-to-video generation via multimodal planning and structured noise initialization.arXivpreprint arXiv:2504.08641, 2025

    Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, and Mohit Bansal. Training-free guidance in text-to-video generation via multimodal planning and structured noise initialization.arXivpreprint arXiv:2504.08641, 2025

  17. [17]

    Two-step lidar/camera/imu spatial and temporal calibration based on continuous-time trajectory estimation.IEEETransactionsonIndustrialElectronics, 71 (3):3182–3191, 2023

    Shengyu Li, Xingxing Li, Shuolong Chen, Yuxuan Zhou, and Shiwen Wang. Two-step lidar/camera/imu spatial and temporal calibration based on continuous-time trajectory estimation.IEEETransactionsonIndustrialElectronics, 71 (3):3182–3191, 2023

  18. [18]

    Towards understanding camera motions in any video

    Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Yu Tong Tiffany Ling, Yuhan Huang, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, and Deva Ramanan. Towards understanding camera motions in any video. InNeurIPSDBTrack, 2025. 13

  19. [19]

    Chatcam: Empowering camera control through conversational ai

    Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. Chatcam: Empowering camera control through conversational ai. NeurIPS, 37:54483–54506, 2024

  20. [20]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.NeurIPS, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.NeurIPS, 36:46212–46244, 2023

  21. [21]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprintarXiv:2304.07193, 2023

  22. [22]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

  23. [23]

    arXiv preprint arXiv:2501.19061 (2025)

    Heqian Qiu, Zhaofeng Shi, Lanxiao Wang, Huiyu Xiong, Xiang Li, and Hongliang Li. Egome: A new dataset and challenge for following me via egocentric view in real world.arXivpreprintarXiv:2501.19061, 2025

  24. [24]

    Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller,SanjaFidler,andJunGao. Gen3c: 3d-informedworld-consistentvideogenerationwithprecisecameracontrol. In CVPR, pages 6121–6132, 2025

  25. [25]

    Dynamic camera poses and where to find them

    Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F Fouhey, and Chen-Hsuan Lin. Dynamic camera poses and where to find them. InCVPR, pages 12444–12455, 2025

  26. [26]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, SoumyaBatra,PrajjwalBhargava,ShrutiBhosale,etal. Llama2: Openfoundationandfine-tunedchatmodels. arXiv preprint arXiv:2307.09288, 2023

  27. [27]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

  28. [28]

    Uniadapter: All-in-one control for flexible video generation.TCSVT, 2025

    Cong Wang, Panwen Hu, Haoyu Zhao, Yuanfan Guo, Jiaxi Gu, Xiao Dong, Jianhua Han, Hang Xu, and Xiaodan Liang. Uniadapter: All-in-one control for flexible video generation.TCSVT, 2025

  29. [29]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

  30. [30]

    Tram: Global trajectory and motion of 3d humans from in-the-wild videos

    Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. InECCV, pages 467–487. Springer, 2024

  31. [31]

    YuqiWang,KeCheng,JiaweiHe,QitaiWang,HengchenDai,YuntaoChen,FeiXia,andZhaoxiangZhang.Drivingdojo dataset: Advancing interactive and knowledge-enriched driving world model.arXiv preprint arXiv:2410.10738, 2024

  32. [32]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACMSIGGRAPH, pages 1–11, 2024

  33. [33]

    Vary: Scaling up the vision vocabulary for large vision-language model

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. InECCV, pages 408–424. Springer, 2024

  34. [34]

    St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprint arXiv:2503.12542, 2025

    Peiran Wu, Yunze Liu, Miao Liu, and Junxiao Shen. St-think: How multimodal large language models reason about 4d worlds from ego-centric videos.arXiv preprintarXiv:2503.12542, 2025

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

  36. [36]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  37. [37]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval

    Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InSIGGRAPH Asia, pages 1–11, 2025

  38. [38]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, pages 100–111, 2025. 14

  39. [39]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023

  40. [40]

    Cameranoise: Learning precise camera control with video diffusion in noise space.https://openreview.net/forum?id=TT3gmYaqyc, 2025

    Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, and Zuxuan Wu. Cameranoise: Learning precise camera control with video diffusion in noise space.https://openreview.net/forum?id=TT3gmYaqyc, 2025. Accessed: 2025-09-14

  41. [41]

    Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing

    Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Qingping Zheng, Zuxuan Wu, Hang Xu, and Yu-Gang Jiang. Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing. InECCV, pages 205–221, 2025

  42. [42]

    Lstd: Long short-term temporal diffusion for video generation.TMM, page doi: 10.1109/TMM.2026.3651052, 2026

    Haoyu Zhao, Jiaxi Gu, Shicong Wang, Tianyi Lu, Xing Zhang, Zuxuan Wu, Hang Xu, and Yu-Gang Jiang. Lstd: Long short-term temporal diffusion for video generation.TMM, page doi: 10.1109/TMM.2026.3651052, 2026

  43. [43]

    Refocuseraser: Refocusing for small object removal with robust context-shadow repair

    Qingping Zheng, Bo Huang, Yang Liu, Haoyu Zhao, Ling Zheng, Zengmao Wang, Ying Li, and Jiankang Deng. Refocuseraser: Refocusing for small object removal with robust context-shadow repair. InICLR

  44. [44]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprintarXiv:1805.09817, 2018. 15 6 Appendix 6.1 More Implementation Details Evaluationmetrics.ThecameratrajectoriespredictedbyourproposedCT-1modelareinherentlynon-unique. This means that multiple trajector...