Recognition: 1 theorem link
· Lean TheoremViPE: Video Pose Engine for 3D Geometric Perception
Pith reviewed 2026-05-16 16:36 UTC · model grok-4.3
The pith
ViPE estimates camera poses and near-metric depth maps from any raw video without calibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViPE is a versatile video processing engine that efficiently estimates camera intrinsics, camera motion, and dense near-metric depth maps from unconstrained raw videos. It remains robust across dynamic selfie videos, cinematic shots, and dashcams while supporting pinhole, wide-angle, and 360-degree panorama camera models. On standard benchmarks, ViPE outperforms existing uncalibrated pose estimation baselines by 18 percent on TUM sequences and 50 percent on KITTI sequences.
What carries the argument
The ViPE engine, a unified pipeline that jointly solves for intrinsics, motion, and dense depth from uncalibrated video input.
If this is right
- Outperforms uncalibrated baselines by 18 percent on TUM and 50 percent on KITTI pose estimation.
- Annotates approximately 96 million frames with camera poses and dense depth maps.
- Supports pinhole, wide-angle, and 360-degree camera models in a single pipeline.
- Runs at 3-5 frames per second on one GPU for standard resolutions.
- Supplies large-scale annotated data from real internet videos, AI-generated content, and panoramas for spatial AI training.
Where Pith is reading between the lines
- Existing video archives could be automatically converted into training resources for many 3D vision models without manual labeling.
- Robotics and augmented-reality applications might gain easier access to metric-scale geometry from ordinary consumer footage.
- Outputs from the engine could serve as direct supervision signals inside larger end-to-end reconstruction networks.
Load-bearing premise
The engine produces reliable near-metric depth and accurate poses on diverse in-the-wild videos without per-video calibration or ground-truth supervision.
What would settle it
Independent ground-truth measurement of camera poses and depths on a fresh collection of in-the-wild videos, checked against ViPE outputs for error rates below the reported benchmark levels.
read the original abstract
Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViPE, a video processing engine that estimates camera intrinsics, poses, and dense near-metric depth maps from unconstrained raw videos. It claims robustness across dynamic selfie videos, cinematic shots, dashcams, and camera models including pinhole, wide-angle, and 360° panoramas. ViPE reportedly outperforms uncalibrated pose estimation baselines by 18% on TUM and 50% on KITTI sequences, runs at 3-5 FPS on a single GPU, and is used to annotate ~96M frames from 100K internet videos, 1M AI-generated videos, and 2K panoramic videos with accurate poses and depth maps. The code and annotated dataset are open-sourced.
Significance. If the performance and annotation accuracy claims hold, ViPE would provide a practical tool and large-scale resource for 3D geometric perception in spatial AI, addressing the scarcity of consistent in-the-wild annotations. The benchmark gains on standard datasets, efficiency, and support for diverse camera models are concrete strengths; open-sourcing the engine and dataset further enhances potential impact.
major comments (2)
- [Large-scale annotation and results section] The central claim that ViPE annotates ~96M frames from unconstrained videos (internet, AI-generated, panoramic) with 'accurate' camera poses and dense depth maps lacks any reported quantitative metrics, error analysis, consistency checks, or failure-case evaluation on those videos. Only TUM and KITTI results with ground truth are quantified; this generalization is load-bearing for the dataset contribution and requires direct evidence.
- [Method section] The method for producing near-metric depth and reliable poses without per-video calibration or ground-truth supervision is not supported by derivations, equations, or ablations in the manuscript. The abstract asserts robustness to dynamic content, lighting, and non-pinhole models, but without concrete pipeline details or assumptions, the reliability on in-the-wild data cannot be assessed.
minor comments (2)
- [Abstract] The abstract uses informal phrasing such as 'handy and versatile'; replace with more formal terms like 'practical and versatile' for journal style.
- [Experiments section] Benchmark comparisons should explicitly state the uncalibrated baselines used and include error metrics (e.g., absolute trajectory error) with standard deviations for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, acknowledging where additional evidence and details are warranted, and outline the revisions we will incorporate.
read point-by-point responses
-
Referee: [Large-scale annotation and results section] The central claim that ViPE annotates ~96M frames from unconstrained videos (internet, AI-generated, panoramic) with 'accurate' camera poses and dense depth maps lacks any reported quantitative metrics, error analysis, consistency checks, or failure-case evaluation on those videos. Only TUM and KITTI results with ground truth are quantified; this generalization is load-bearing for the dataset contribution and requires direct evidence.
Authors: We agree that the manuscript would be strengthened by direct quantitative evidence on the large-scale annotations. The TUM and KITTI benchmarks demonstrate the core method's accuracy where ground truth exists, while the 96M-frame collection was validated through extensive visual inspection and temporal consistency checks that were not quantified in the text. In the revised manuscript we will add a dedicated subsection reporting quantitative consistency metrics (e.g., average pose drift over long sequences, inter-frame depth map variance, and reprojection error statistics) computed on a stratified sample of the annotated videos, together with a failure-case analysis. This will provide the requested direct evidence for the dataset contribution. revision: yes
-
Referee: [Method section] The method for producing near-metric depth and reliable poses without per-video calibration or ground-truth supervision is not supported by derivations, equations, or ablations in the manuscript. The abstract asserts robustness to dynamic content, lighting, and non-pinhole models, but without concrete pipeline details or assumptions, the reliability on in-the-wild data cannot be assessed.
Authors: We acknowledge that the current method description is high-level and would benefit from explicit technical support. The manuscript outlines the joint optimization pipeline, but we will expand the method section with the full set of equations for intrinsics estimation, pose optimization, and near-metric depth regression, including the loss terms and the key assumptions (e.g., scale anchoring via learned priors). We will also add targeted ablations that isolate performance under dynamic content, varying illumination, and non-pinhole camera models. These additions will make the reliability claims on unconstrained videos directly verifiable. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents ViPE as an engineering system for video-based pose and depth estimation, with performance claims supported by benchmarks on external datasets (TUM, KITTI) possessing independent ground truth. No mathematical derivations, equations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-generated labels. The large-scale annotation of 96M frames is presented as an output application rather than a self-referential prediction, and no load-bearing self-citations, uniqueness theorems, or ansatz smuggling are invoked in the provided text. The central claims remain externally falsifiable via the cited benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
CalibAnyView: Beyond Single-View Camera Calibration in the Wild
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
-
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
-
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 9, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
- [3]
-
[4]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y. Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pages 611–625. Springer, 2012. 11
work page 2012
- [7]
- [8]
- [9]
-
[10]
T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 13
work page 2024
- [11]
-
[12]
H. K. Cheng and A. G. Schwing. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. InEuropean Conference on Computer Vision, pages 640–658. Springer, 2022. 7
work page 2022
- [13]
- [14]
- [15]
-
[16]
T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. Algorithm 836: Colamd, a column approximate minimum degree ordering algorithm.ACM Transactions on Mathematical Software (TOMS), 30(3):377–380, 2004. 5
work page 2004
-
[17]
A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007. 2, 3
work page 2007
-
[18]
B. Duisterhof, L. Zust, P. Weinzaepfel, V. Leroy, Y. Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion.arXiv preprint arXiv:2409.19152, 2024. 3 17 ViPE: Video Pose Engine for 3D Geometric Perception
-
[19]
S. Elflein, Q. Zhou, and L. Leal-Taixé. Light3r-sfm: Towards feed-forward structure-from-motion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16774–16784, 2025. 3
work page 2025
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
-
[26]
A. Hagemann, M. Knorr, and C. Stiller. Deep geometry-aware camera self-calibration from video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3438–3448, 2023. 7
work page 2023
-
[27]
M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3, 6
work page 2024
- [28]
- [29]
-
[30]
S. Izquierdo, M. Sayed, M. Firman, G. Garcia-Hernando, D. Turmukhambetov, J. Civera, O. Mac Aodha, G. Brostow, and J. Watson. Mvsanywhere: Zero-shot multi-view stereo. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11493–11504, 2025. 4
work page 2025
- [31]
- [32]
- [33]
-
[34]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,
-
[35]
A. Korovko, D. Slepichev, A. Efitorov, A. Dzhumamuratova, V. Kuznetsov, H. Rabeti, and J. Biswas. cuvslam: Cuda accelerated visual odometry.arXiv preprint arXiv:2506.04359, 2025. 3, 6
- [36]
-
[37]
Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10486–10496, 2025. 2, 3, 9, 10, 11 18 ViPE: Video Pose Engine for 3D Geometric Perception
work page 2025
- [38]
- [39]
-
[40]
P. Lindenberger, P.-E. Sarlin, and M. Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 10
work page 2023
- [41]
-
[42]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, pages 38–55. Springer, 2024. 7
work page 2024
-
[43]
Y. Liu, S. Dong, S. Wang, Y. Yin, Y. Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662,
-
[44]
J. Lu, T. Huang, P. Li, Z. Dou, C. Lin, Z. Cui, Z. Dong, S.-K. Yeung, W. Wang, and Y. Liu. Align3r: Aligned monocular depth estimation for dynamic videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22820–22830, 2025. 3
work page 2025
- [45]
-
[46]
B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InIJCAI’81: 7th international joint conference on Artificial intelligence, volume 2, pages 674–679, 1981. 6
work page 1981
- [47]
- [48]
-
[49]
R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015. 2, 3, 9
work page 2015
- [50]
-
[51]
L. Pan, D. Baráth, M. Pollefeys, and J. L. Schönberger. Global structure-from-motion revisited. InEuropean Conference on Computer Vision, pages 58–77. Springer, 2024. 3
work page 2024
-
[52]
L. Piccinelli, C. Sakaridis, M. Segu, Y.-H. Yang, S. Li, W. Abbeloos, and L. Van Gool. Unik3d: Universal camera monocular 3d estimation.arXiv preprint arXiv:2503.16591, 2025. 3, 6
-
[53]
L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 3, 6, 11
-
[54]
V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. Torr, and D. W. Murray. Infinitam v3: A framework for large-scale 3d reconstruction with loop closure.arXiv preprint arXiv:1708.00783, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [55]
-
[56]
X. Ren, Y. Lu, H. Liang, Z. Wu, H. Ling, M. Chen, S. Fidler, F. Williams, and J. Huang. Scube: Instant large-scale scene reconstruction using voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 4 19 ViPE: Video Pose Engine for 3D Geometric Perception
work page 2024
-
[57]
X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d- informed world-consistent video generation with precise camera control. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6121–6132, 2025. 4, 13
work page 2025
-
[58]
C. Rockwell, J. Tung, T.-Y. Lin, M.-Y. Liu, D. F. Fouhey, and C.-H. Lin. Dynamic camera poses and where to find them. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12444–12455, 2025. 10, 12
work page 2025
-
[59]
J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 3
work page 2016
- [60]
- [61]
- [62]
- [63]
-
[64]
Z. Tang, Y. Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5283–5293, 2025. 3
work page 2025
- [65]
-
[66]
Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 4, 5, 7, 9
work page 2021
-
[67]
A. Veicht, P.-E. Sarlin, P. Lindenberger, and M. Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuropean Conference on Computer Vision, pages 1–20. Springer, 2024. 4, 9
work page 2024
-
[68]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
H. Wang and L. Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 3
-
[70]
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 3, 9, 11
work page 2025
-
[71]
J. Wang, N. Karaev, C. Rupprecht, and D. Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024. 3
work page 2024
-
[72]
Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang. 360dvd: Controllable panorama video generation with 360-degree video diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6913–6923, 2024. 13
work page 2024
- [73]
-
[74]
S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3
work page 2024
-
[75]
Y. Wang, J. Zhou, H. Zhu, W. Chang, Y. Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He. pi3: Scalable permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [76]
-
[77]
F. Wimbauer, W. Chen, D. Muhle, C. Rupprecht, and D. Cremers. Anycam: Learning to recover camera poses and intrinsics from casual videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 3
work page 2025
-
[78]
R. Wu, R. Gao, B. Poole, A. Trevithick, C. Zheng, J. T. Barron, and A. Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26057–26068, 2025. 4
work page 2025
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.