pith. machine review for the scientific record. sign in

arxiv: 2604.07990 · v2 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Christian Rupprecht, David Novotny, Jianyuan Wang, Kecheng Zheng, Minghao Chen, Wenjun Zeng, Xing Zhu, Xin Jin, Yinghao Xu, Yujun Shen, Yunnan Wang

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords large-scale video datasetgeometric annotationsdepth mapscamera parameters3D point trackingtext-to-video synthesisscene reconstructionmulti-modal video data
0
0 comments X

The pith

SceneScribe-1M supplies one million videos with text descriptions, camera parameters, dense depth maps, and consistent 3D point tracks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SceneScribe-1M to fill the gap between video datasets that support only 3D geometric tasks and those that support only semantic or generative tasks. It consists of one million real-world videos, each carrying detailed text captions along with geometric labels that include camera intrinsics and extrinsics, per-frame depth, and temporally consistent 3D point tracks. These annotations let the same data serve as training and test material for perception problems such as depth estimation, scene reconstruction, and point tracking, as well as for generative problems such as text-to-video synthesis that can be conditioned on camera motion. A reader would care because a single high-quality source could remove the need to stitch together mismatched datasets when building models that must both understand and create dynamic 3D video.

Core claim

We introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control.

What carries the argument

The SceneScribe-1M dataset itself, which pairs each of one million videos with a complete set of semantic text descriptions and geometric annotations consisting of camera parameters, dense depth maps, and temporally consistent 3D point tracks.

If this is right

  • Monocular depth estimation models can be trained and tested on a much larger and more diverse collection of real video sequences than before.
  • Scene reconstruction methods gain direct access to dense per-frame depth and long-term 3D point correspondences for improved accuracy.
  • Dynamic point tracking algorithms receive consistent 3D tracks that span entire videos rather than short clips.
  • Text-to-video generators can be conditioned on explicit camera trajectories derived from the provided parameters.
  • Perception and generation research can share the same data source and therefore compare results on identical video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The joint availability of semantic and geometric labels may encourage training regimes that enforce geometric consistency inside video generation networks.
  • Researchers could derive additional self-supervision signals by checking that generated videos respect the same camera and depth constraints present in the dataset.
  • The scale of consistent 3D tracks could support new forms of long-range video understanding that current short-clip datasets cannot address.
  • Similar annotation pipelines might later be applied to other large video collections to expand coverage beyond the current one million sequences.

Load-bearing premise

The camera parameters, depth maps, and 3D point tracks supplied for every video are accurate and consistent enough to serve as reliable supervision and evaluation targets.

What would settle it

A measurement showing that the supplied depth maps or camera parameters deviate substantially from independent ground-truth sensors on a held-out set of videos, or a finding that models trained on these annotations show no improvement over models trained on existing smaller datasets.

Figures

Figures reproduced from arXiv: 2604.07990 by Christian Rupprecht, David Novotny, Jianyuan Wang, Kecheng Zheng, Minghao Chen, Wenjun Zeng, Xing Zhu, Xin Jin, Yinghao Xu, Yujun Shen, Yunnan Wang.

Figure 1
Figure 1. Figure 1: SceneScribe-1M offers more than one million dynamic scenes spanning over 4,000 hours, featuring comprehensive semantic and geometric annotations (i.e., detailed description, motion masks, camera poses, continuous video depths, and dynamic tracks). It supports diverse downstream tasks (i.e., modular depth estimation, scene reconstruction, dynamic point tracking, and pose/text-to-video generation). Abstract … view at source ↗
Figure 2
Figure 2. Figure 2: Curation Pipeline for SceneScribe-1M consist of: (a) We begin by collecting large-scale videos from various sources; (b) Raw videos undergo specification and content inspection, with temporal segmentation models employed to ensure continuity; and (c) We integrate Qwen2.5-VL-72B [6], MegaSaM [33], and TAPIP3D [68] to perform comprehensive geometric and semantic annotations. • Extensive Downstream Evaluation… view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of Raw Video Specification after filtering, including Resolution, Frame Per Second (FPS), and Duration [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of Raw Video Content after filtering. These charts demonstrate that the raw videos exhibit sufficient diversity of motion while eliminating the lighting interference. 3.1. Source Video Collection Video Source for SceneScribe-1M. To ensure the diver￾sity and scale of SceneScribe-1M, we start by incorporat￾ing publicly available large-scale text-video paired datasets, i.e., HD-VILA-100M [63], Pand… view at source ↗
Figure 6
Figure 6. Figure 6: Statistics of Object Motion Metrics. It can be observed that both object motion metrics in SceneScribe-MVS after applying the sampling strategy exhibit a greater static degree than the thresholds. This demonstrates that our sampling not only facilitates effective dynamic mask generation within SceneScribe-1M, but also improves control over the proportion of dynamics. (a) Distance (b) Rotation (c) Turn Coun… view at source ↗
Figure 7
Figure 7. Figure 7: Statistics of Camera Motion Metrics. The similar distributions of camera motion metrics in SceneScribe-1M and SceneScribe￾MVS indicate that we disentangle camera and object motion, enabling control over object dynamics while preserving camera diversity. the camera and object. In addition to providing object mo￾tion masks for all scenes, we devise a sampling strategy to construct a compact subset, SceneScri… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization Results of Downstream Tasks. We conduct various downstream task on SceneScribe-1M, i.e., MoGe [54] (monocular depth estimation), VGGT [50] (3D reconstruction), MonST3R [69] (4D reconstruction), CoTracker3 [26] (2D Point Tracking), SpatialTrackerV2 [59] (3D Point Tracking) and A3CD [3]. These results highlight the robust applicability of SceneScribe-1M in 3D perception and video generation, of… view at source ↗
read the original abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SceneScribe-1M, a new large-scale dataset of one million in-the-wild videos, each annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. It establishes benchmarks for downstream tasks including monocular depth estimation, scene reconstruction, dynamic point tracking, and text-to-video synthesis with or without camera control, and releases the dataset openly.

Significance. If the geometric annotations prove sufficiently accurate and consistent, the dataset would be a significant contribution by providing a unified, large-scale resource that bridges semantic video understanding and 3D geometric perception, enabling new work on controllable video generation and dynamic scene modeling. The open release of the full dataset and annotations is a clear strength that supports reproducibility.

major comments (2)
  1. [§3] §3 (Dataset Construction): The automated pipelines used to derive camera parameters, dense depth maps, and 3D point tracks from in-the-wild videos are described at a high level, but no quantitative validation protocol, error distributions, or comparison against independent ground-truth references on a held-out diverse subset is reported. This directly affects the load-bearing claim that the annotations are 'precise' and 'consistent' enough to support the listed benchmarks.
  2. [§4] §4 (Benchmarks): The reported results for monocular depth estimation, scene reconstruction, and dynamic point tracking assume the provided annotations serve as reliable supervision or evaluation targets, yet no ablation or sensitivity analysis quantifies how annotation noise (from SfM, monocular depth models, or trackers) affects the observed performance gaps versus prior datasets.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'precise camera parameters' and 'meticulously annotated' should be qualified to reflect that these quantities are estimated rather than directly measured, to avoid overstating annotation fidelity.
  2. [§2] §2 (Related Work): A more explicit comparison table contrasting SceneScribe-1M against existing video datasets (e.g., in scale, annotation types, and validation) would help readers assess the claimed gap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of the dataset construction and benchmark evaluations.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The automated pipelines used to derive camera parameters, dense depth maps, and 3D point tracks from in-the-wild videos are described at a high level, but no quantitative validation protocol, error distributions, or comparison against independent ground-truth references on a held-out diverse subset is reported. This directly affects the load-bearing claim that the annotations are 'precise' and 'consistent' enough to support the listed benchmarks.

    Authors: We agree that the current description in §3 is high-level and that explicit quantitative validation would better support the claims of precision and consistency. The pipelines rely on established components (COLMAP for SfM, recent monocular depth models, and multi-view consistent trackers), each of which has been validated in the literature. To address the referee's concern directly, the revised manuscript will add a new subsection in §3 that reports (1) a quantitative validation protocol, (2) error distributions (e.g., camera pose error, depth MAE, track consistency) on a held-out diverse subset of 5,000 videos, and (3) cross-validation against alternative pipelines and publicly available datasets that contain partial ground-truth geometry. We note that fully independent 3D ground truth does not exist for the majority of in-the-wild videos; therefore the added analysis will emphasize consistency metrics and proxy evaluations rather than claiming absolute ground-truth accuracy. revision: yes

  2. Referee: [§4] §4 (Benchmarks): The reported results for monocular depth estimation, scene reconstruction, and dynamic point tracking assume the provided annotations serve as reliable supervision or evaluation targets, yet no ablation or sensitivity analysis quantifies how annotation noise (from SfM, monocular depth models, or trackers) affects the observed performance gaps versus prior datasets.

    Authors: We concur that a sensitivity analysis would improve the interpretability of the benchmark results. In the revised §4 we will add an ablation study that (1) varies annotation confidence thresholds and noise injection levels, (2) measures the resulting changes in downstream task performance, and (3) compares the magnitude of these effects against the performance gaps reported versus prior datasets. This analysis will be presented alongside the existing benchmark tables to clarify the robustness of the observed improvements. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circularity in claims

full rationale

The paper introduces SceneScribe-1M as a new annotated video dataset without any mathematical derivations, model predictions, or first-principles results. Its claims center on the dataset's scale, annotation types (text, cameras, depth, tracks), and downstream benchmarks, none of which reduce by construction to fitted parameters or self-citations. Annotation pipelines are external tools applied to data; the paper does not claim to derive or predict the annotations from the dataset itself. Concerns about annotation accuracy are validation issues, not circularity. The work is self-contained as a data contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a dataset release paper there are no mathematical derivations or fitted parameters; the claim rests on the assumption that high-quality annotations can be produced at scale.

axioms (1)
  • domain assumption High-quality, consistent annotations for camera parameters, dense depth maps, and 3D point tracks can be reliably produced for one million in-the-wild videos
    The dataset's utility depends on annotation accuracy, which is asserted but not demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5536 in / 1188 out tokens · 72438 ms · 2026-05-10T17:12:39.639792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 28 canonical work pages · 12 internal anchors

  1. [1]

    com / UmiMarch / OpenVideo, 2023

    Openvideo.https : / / github . com / UmiMarch / OpenVideo, 2023. 4

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 1, 3

  3. [3]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transform- ers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transform- ers. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 22875– 22889, 2025. 3, 6, 7, 8

  4. [4]

    Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024

    Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024. 3

  5. [5]

    Recammaster: Camera-controlled generative rendering from a single video, 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lian- rui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 2, 3

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  8. [8]

    Video generation models as world simulators.OpenAI Blog, 1:1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1:1, 2024. 3

  9. [9]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InProceedings of the European Conference on Computer Vision (ECCV), pages 611–625, 2012. 8

  10. [10]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 2, 3

  11. [11]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 2

  12. [12]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13320–13331, 2024. 2, 3, 4

  13. [13]

    Genie 3: A new frontier for world models

    DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/blog/genie-3-a- new-frontier-for-world-models, 2024. 1, 3

  14. [14]

    Gemini 2.0 flash.https : / / deepmind

    Google DeepMind. Gemini 2.0 flash.https : / / deepmind . google / technologies / gemini / flash/, 2024. 5

  15. [15]

    Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems (NeurIPS), pages 13610–13626, 2022

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems (NeurIPS), pages 13610–13626, 2022. 8

  16. [16]

    Google scanned objects: A high- quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. InProceed- ings of the International Conference on Robotics and Au- tomation (ICRA), pages 2553–2560, 2022. 8

  17. [17]

    arXiv preprint arXiv:2506.01943 , year=

    Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video gen- eration for robotic manipulation with collaborative trajectory control.arXiv preprint arXiv:2506.01943, 2025. 2

  18. [18]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3749–3761, 2022. 8

  19. [19]

    3d packing for self-supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2485–2494, 2020. 8

  20. [20]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3

  21. [21]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 3

  22. [22]

    Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025

    Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 2

  23. [23]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

  24. [24]

    How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3

  25. [25]

    Dy- namicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13229– 13239, 2023. 8

  26. [26]

    Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6013–6022, 2025. 3, 6, 7, 8

  27. [27]

    Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020

    Tobias Koch, Lukas Liebel, Marco K ¨orner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset.Computer Vision and Image Understanding (CVIU), 191:102877, 2020. 8

  28. [28]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

  29. [29]

    Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems (NeurIPS), 37:82149–82165, 2024

    Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Joao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. Tapvid-3d: A benchmark for tracking any point in 3d.Advances in Neural Information Processing Systems (NeurIPS), 37:82149–82165, 2024. 8

  30. [30]

    arXiv preprint arXiv:2510.18313 (2025)

    Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 2

  31. [31]

    Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World mod- els amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 2

  32. [32]

    arXiv preprint arXiv:2506.15675 (2025)

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025. 1, 2, 3

  33. [33]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10486–10496, 2025. 2, 3, 4, 5, 8

  34. [34]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 3

  35. [35]

    Orb-slam: A versatile and accurate monocular slam system.IEEE Transactions on Robotics (TRO), 31:1147– 1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE Transactions on Robotics (TRO), 31:1147– 1163, 2015. 2

  36. [36]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 2, 3

  37. [37]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024. 3, 5

  38. [38]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10901–10911, 2021. 2, 8

  39. [39]

    Dynamic cam- era poses and where to find them

    Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F Fouhey, and Chen-Hsuan Lin. Dynamic cam- era poses and where to find them. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12444–12455, 2025. 2, 3

  40. [40]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016. 2

  41. [41]

    Bad slam: Bundle adjusted direct rgb-d slam

    Thomas Schops, Torsten Sattler, and Marc Pollefeys. Bad slam: Bundle adjusted direct rgb-d slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 134–144, 2019. 8

  42. [42]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InProceedings of the European Conference on Computer Vision (ECCV), pages 746–760, 2012. 8

  43. [43]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 2, 3

  44. [44]

    Transnet v2: An effec- tive deep network architecture for fast shot transition detec- tion

    Tom ´as Soucek and Jakub Lokoc. Transnet v2: An effec- tive deep network architecture for fast shot transition detec- tion. InProceedings of the ACM International Conference on Multimedia (ACM MM), pages 11218–11221, 2024. 4

  45. [45]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neural Information Processing Systems (NeurIPS), pages 16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neural Information Processing Systems (NeurIPS), pages 16558–16569, 2021. 5

  46. [46]

    Deep patch vi- sual odometry.Advances in Neural Information Processing Systems (NeurIPS), pages 39033–39051, 2023

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems (NeurIPS), pages 39033–39051, 2023. 5

  47. [47]

    Sparsity invariant cnns

    Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. InProceedings of the International Conference on 3D Vision (3DV), pages 11–20, 2017. 8

  48. [48]

    Dai, Andrea F

    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Moham- madreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463, 2019. 8

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  50. [50]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 3, 5, 6, 7, 8

  51. [51]

    Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,

  52. [52]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8428–8437, 2025. 2, 3, 4

  53. [53]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10510–10522, 2025. 3

  54. [54]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025. 3, 6, 7

  55. [55]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 5

  56. [56]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916, 2020. 7

  57. [57]

    Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Process- ing Systems (NeurIPS), 37:98478–98504, 2024

    Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, and Xin Jin. Scene graph disentanglement and composition for generalizable complex image generation.Advances in Neural Information Process- ing Systems (NeurIPS), 37:98478–98504, 2024. 3

  58. [58]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20406–20417, 2024. 3

  59. [59]

    Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025. 3, 6, 7, 8

  60. [60]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2024. 3

  61. [61]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296,

  62. [62]

    Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991,

    Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, and Jun Huang. Easyanimate: A high-performance long video generation method based on transformer architecture.arXiv preprint arXiv:2405.18991,

  63. [63]

    Ad- vancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Ad- vancing high-resolution video-language representation with large-scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5036–5045, 2022. 2, 3, 4

  64. [64]

    Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21924–21935, 2025. 5

  65. [65]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024. 3, 5

  66. [66]

    Blendedmvs: A large- scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 1790–1799,

  67. [67]

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video gen- eration.Advances in Neural Information Processing Systems (NeurIPS), pages 21236–21270, 2024. 2, 3

  68. [68]

    arXiv2504.14717(2025) 6

    Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragki- adaki. Tapip3d: Tracking any point in persistent 3d geome- try.arXiv preprint arXiv:2504.14717, 2025. 2, 3, 4, 5, 8

  69. [69]

    arXiv preprint arXiv:2410.03825 (2024)

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024. 3, 5, 6, 7, 8

  70. [70]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 21936–21947,

  71. [71]

    Genxd: Generating any 3d and 4d scenes.arXiv preprint arXiv:2411.02319, 2024

    Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, and Lijuan Wang. Genxd: Generating any 3d and 4d scenes. arXiv preprint arXiv:2411.02319, 2024. 2, 3

  72. [72]

    Pointodyssey: A large-scale synthetic dataset for long-term point tracking

    Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (CVPR), pages 19855–19865, 2023. 2, 8

  73. [73]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37:1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37:1–12, 2018. 2, 3, 8

  74. [74]

    Celebv- hq: A large-scale video facial attributes dataset

    Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv- hq: A large-scale video facial attributes dataset. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 650–667, 2022. 2, 3