pith. machine review for the scientific record. sign in

arxiv: 2605.14615 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera calibrationmulti-viewtransformergeometric consistencyperspective fieldsintrinsics estimationin-the-wild perception
0
0 comments X

The pith

CalibAnyView enables camera calibration from any number of views in the wild by enforcing cross-view geometric consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CalibAnyView, a method that handles camera calibration for one or more images from everyday, uncontrolled settings. It builds a large dataset of multi-view videos with varied cameras and scenes, then uses a transformer to predict perspective fields that feed into an optimization step for intrinsics and gravity. A sympathetic reader would care because traditional calibration requires special patterns or controlled conditions, while existing learning methods ignore consistency between multiple views, making them less reliable for real videos or robot vision. If successful, this provides a foundation for better 3D perception in natural environments.

Core claim

CalibAnyView is a unified formulation supporting an arbitrary number of input views by explicitly modeling cross-view geometric consistency, constructed via a large-scale multi-view video dataset and a multi-view transformer that predicts dense perspective fields integrated into geometric optimization to estimate camera intrinsics and gravity direction.

What carries the argument

The multi-view transformer predicting dense perspective fields, combined with a geometric optimization framework for joint intrinsics and gravity estimation.

If this is right

  • Outperforms state-of-the-art single-view calibration methods even in single-view mode.
  • Performance improves further when multiple views are provided.
  • Supports reliable 3D reconstruction and robotic perception from in-the-wild imagery.
  • Works robustly across diverse camera models and lens distortions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Could be integrated into video-based SLAM systems for continuous calibration.
  • May enable calibration from casual phone videos without special setups.
  • Opens the door to learning-based calibration for other sensors like depth cameras if similar datasets are built.

Load-bearing premise

The constructed large-scale multi-view video dataset covers sufficient diversity of real-world camera models, dynamic scenes, motion trajectories, and lens distortions to allow generalization to new inputs.

What would settle it

Evaluating the model on a held-out test set featuring camera models or distortion patterns entirely absent from the training dataset and measuring if intrinsic estimates remain accurate.

Figures

Figures reproduced from arXiv: 2605.14615 by Boying Li, Cheng Zhang, Daniel Cremers, Hamid Rezatofighi, Ian Reid, Weirong Chen.

Figure 1
Figure 1. Figure 1: CalibAnyView calibrates cameras from arbitrary views in the wild. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CalibAnyView framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the dataset construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibration performance vs. number of views. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on Stanford2D3D (multi-view, [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on the TartanAir dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the Stanford2D3D dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on the MegaDepth dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on the LaMAR dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt used for video filtering [PITH_FULL_IMAGE:figures/full_fig_p042_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Synthesized Projected Video Frame Sequences (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Synthesized Projected Video Frame Sequences (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗
read the original abstract

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CalibAnyView, a unified formulation for camera calibration supporting an arbitrary number of input views (N ≥ 1). It constructs a large-scale multi-view video dataset covering diverse camera models, dynamic scenes, motion trajectories, and lens distortions; trains a multi-view transformer to predict dense perspective fields; and integrates these into a geometric optimization framework to jointly recover intrinsics and gravity direction. The central claim is that the method consistently outperforms prior single-view state-of-the-art approaches, remains robust in single-view settings, and yields further gains under multi-view inference.

Significance. If the quantitative claims hold, the work is significant because it extends learning-based single-view calibration to enforce explicit cross-view geometric consistency without requiring controlled capture setups. The new dataset and the perspective-field-plus-optimization pipeline could serve as a practical foundation for downstream tasks such as 3D reconstruction and robotic perception in unconstrained environments.

major comments (3)
  1. [Dataset Construction] Dataset section: the assertion that the constructed multi-view video dataset sufficiently covers real-world diversity (camera models, lens distortions, trajectories) is load-bearing for the generalization claim, yet no statistical comparison (histograms, divergence metrics, or coverage statistics) is provided against independent corpora such as KITTI, nuScenes, or existing calibration benchmarks.
  2. [Experiments] Experiments section: the abstract states that CalibAnyView 'consistently outperforms state-of-the-art methods' and 'further improves with multi-view inference,' but the manuscript supplies no numerical tables, baseline implementations, ablation results on view count, or error metrics (e.g., focal-length error, distortion coefficient error, gravity angular error). Without these, the central performance claim cannot be verified.
  3. [Method] Geometric optimization: the integration step that converts predicted perspective fields into joint intrinsics and gravity estimates is described at a high level; the precise objective function, handling of N > 2 views, and any weighting between single-view and cross-view terms are not specified, making it impossible to assess whether reported multi-view gains arise from genuine consistency enforcement or from simple averaging.
minor comments (2)
  1. [Abstract] Abstract: reporting 'consistent outperformance' without any numerical values is atypical and reduces immediate readability; a single sentence summarizing key error reductions would help.
  2. [Method] Notation: the term 'perspective fields' is introduced without an early formal definition or diagram; a short equation or figure in §3 would clarify the representation before the transformer architecture is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to improve clarity, completeness, and verifiability of the claims.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset section: the assertion that the constructed multi-view video dataset sufficiently covers real-world diversity (camera models, lens distortions, trajectories) is load-bearing for the generalization claim, yet no statistical comparison (histograms, divergence metrics, or coverage statistics) is provided against independent corpora such as KITTI, nuScenes, or existing calibration benchmarks.

    Authors: We agree that quantitative coverage analysis would strengthen the generalization argument. In the revised manuscript we will add histograms of focal lengths, distortion coefficients, scene dynamics, and motion trajectories, along with divergence metrics (e.g., KL divergence or Wasserstein distance) comparing our dataset against KITTI, nuScenes, and standard calibration benchmarks. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract states that CalibAnyView 'consistently outperforms state-of-the-art methods' and 'further improves with multi-view inference,' but the manuscript supplies no numerical tables, baseline implementations, ablation results on view count, or error metrics (e.g., focal-length error, distortion coefficient error, gravity angular error). Without these, the central performance claim cannot be verified.

    Authors: We will expand the experiments section to include complete numerical tables with all requested error metrics (focal-length error, distortion coefficients, gravity angular error), explicit baseline implementations, and ablation studies that vary the number of input views (N=1 to N=8). These additions will make the performance claims directly verifiable. revision: yes

  3. Referee: [Method] Geometric optimization: the integration step that converts predicted perspective fields into joint intrinsics and gravity estimates is described at a high level; the precise objective function, handling of N > 2 views, and any weighting between single-view and cross-view terms are not specified, making it impossible to assess whether reported multi-view gains arise from genuine consistency enforcement or from simple averaging.

    Authors: We will provide the exact objective function (including the formulation for arbitrary N>2), the weighting coefficients between single-view and cross-view terms, and a clear description of how perspective-field residuals are aggregated across views. This will demonstrate that the reported gains result from explicit geometric consistency rather than averaging. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard supervised pipeline with independent geometric optimization

full rationale

The paper constructs a multi-view video dataset, trains a transformer to predict dense perspective fields from it, and feeds those predictions into a separate geometric optimization step that enforces cross-view consistency to recover intrinsics and gravity. This chain follows conventional supervised learning plus post-hoc optimization; the optimization equations operate on the model's outputs rather than re-fitting parameters from the same training data, and no load-bearing self-citation or self-definitional loop reduces the claimed predictions to the inputs by construction. Dataset creation and model training introduce ordinary distribution dependence but do not create the circular reductions enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus one domain-specific modeling choice; no new physical entities are postulated.

free parameters (1)
  • Transformer hyperparameters and training schedule
    Typical learned weights and architectural choices tuned on the new dataset to produce perspective fields.
axioms (1)
  • domain assumption Perspective fields can encode sufficient cross-view geometric consistency for joint intrinsics and gravity estimation
    Invoked in the unified formulation and optimization framework described in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 1218 out tokens · 38134 ms · 2026-05-15T05:56:58.310994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

  1. [1]

    Communications of the ACM54(10), 105–112 (2011)

    Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Communications of the ACM54(10), 105–112 (2011)

  2. [2]

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D-3D-Semantic Data for Indoor Scene Understanding. arXiv:1702.01105 (2017)

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    In: CVPR 2011

    Bao, S.Y., Savarese, S.: Semantic structure from motion. In: CVPR 2011. pp. 2025–

  5. [5]

    In: CVMP

    Bogdan, O., Eckstein, V., Rameau, F., Bazin, J.C.: Deepcalib: A deep learning ap- proach for automatic intrinsic calibration of wide field-of-view cameras. In: CVMP. pp. 1–10 (2018)

  6. [6]

    IJCV4(2), 127–139 (1990)

    Caprile, B., Torre, V.: Using vanishing points for camera calibration. IJCV4(2), 127–139 (1990)

  7. [7]

    In: IEEE International Conference on Robotics and Automation

    Carrera, G., Angeli, A., Davison, A.J.: Slam-based automatic extrinsic calibra- tion of a multi-camera rig. In: IEEE International Conference on Robotics and Automation. pp. 2652–2659. IEEE (2011)

  8. [8]

    In: BMVC

    Cipolla, R., Drummond, T., Robertson, D.P.: Camera calibration from vanishing points in image of architectural scenes. In: BMVC. pp. 382–391 (1999)

  9. [9]

    In: IEEE International Conference on Robotics and Automation

    Civera, J., Bueno, D.R., Davison, A.J., Montiel, J.M.M.: Camera self-calibration for sequential bayesian structure from motion. In: IEEE International Conference on Robotics and Automation. pp. 403–408. IEEE (2009)

  10. [10]

    In: ICCV (1999)

    Coughlan, J.M., Yuille, A.L.: Manhattan World: Compass Direction from a Bayesian Inference. In: ICCV (1999)

  11. [11]

    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry40(3), 611–625 (2017)

  12. [12]

    In: ICRA

    Fang, J., Vasiljevic, I., Guizilini, V., Ambrus, R., Shakhnarovich, G., Gaidon, A., Walter, M.R.: Self-supervised camera self-calibration from video. In: ICRA. pp. 8468–8475. IEEE (2022) Abbreviated paper title 19

  13. [13]

    GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

    Fortier-Chouinard, F., Hold-Geoffroy, Y., Deschaintre, V., Gadelha, M., Lalonde, J.F.: Gimbaldiffusion: Gravity-aware camera control for video generation. arXiv preprint arXiv:2512.09112 (2025)

  14. [14]

    In: ICCV

    Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV. pp. 8977–8986 (2019)

  15. [15]

    In: ICCV

    Hagemann, A., Knorr, M., Stiller, C.: Deep geometry-aware camera self-calibration from video. In: ICCV. pp. 3438–3448 (2023)

  16. [16]

    In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

    Hagemann, A., Knorr, M., Stiller, C.: Deep geometry-aware camera self-calibration from video. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 3438–3448 (2023)

  17. [17]

    Autonomous robots39(3), 259–277 (2015)

    Heng, L., Lee, G.H., Pollefeys, M.: Self-calibration and visual slam with a multi- camera system on a micro aerial vehicle. Autonomous robots39(3), 259–277 (2015)

  18. [18]

    IEEE TPAMI (2022)

    Hold-Geoffroy, Y., Piché-Meunier, D., Sunkavalli, K., Bazin, J.C., Rameau, F., Lalonde, J.F.: A Deep Perceptual Measure for Lens and Camera Calibration. IEEE TPAMI (2022)

  19. [19]

    In: CVPR

    Hold-Geoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., Lalonde, J.F.: A perceptual measure for deep single image camera cal- ibration. In: CVPR. pp. 2354–2363 (2018)

  20. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Huang, H., Liu, C., Zhu, Y., Cheng, H., Braud, T., Yeung, S.K.: 360loc: A dataset and benchmark for omnidirectional visual localization with cross-device queries. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22314–22324 (2024)

  21. [21]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev,D.,Lin,C.H.,etal.:Vipe:Videoposeenginefor3dgeometricperception. arXiv preprint arXiv:2508.10934 (2025)

  22. [22]

    In: ICCV (2021)

    Jeong, Y., Ahn, S., Choy, C., Anandkumar, A., Cho, M., Park, J.: Self-calibrating neural radiance fields. In: ICCV (2021)

  23. [23]

    In: CVPR

    Jin, L., Zhang, J., Hold-Geoffroy, Y., Wang, O., Blackburn-Matzen, K., Sticha, M., Fouhey, D.F.: Perspective fields for single image camera calibration. In: CVPR. pp. 17307–17316 (2023)

  24. [24]

    Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS30(2017)

  25. [25]

    In: CVPR (2020)

    Kluger, F., Brachmann, E., Ackermann, H., Rother, C., Yang, M.Y., Rosenhahn, B.: CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus. In: CVPR (2020)

  26. [26]

    In: ECCV

    Košecká, J., Zhang, W.: Video compass. In: ECCV. pp. 476–490. Springer (2002)

  27. [27]

    In: ICCV (2021)

    Lee, J., Go, H., Lee, H., Cho, S., Sung, M., Kim, J.: Ctrl-c: Camera calibration transformer with line-classification. In: ICCV (2021)

  28. [28]

    In: CVPR

    Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: CVPR. pp. 2041–2050 (2018)

  29. [29]

    arXiv preprint arXiv:2510.23589 (2025)

    Liang, E., Bhattacharjee, R., Dey, S., Moschopoulos, R., Wang, C., Liao, M., Tan, G., Wang, A., Kayan, K., Alexandropoulos, S., et al.: Influx: A benchmark for self- calibration of dynamic intrinsics of video cameras. arXiv preprint arXiv:2510.23589 (2025)

  30. [30]

    In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems Datasets and Benchmarks Track (2025)

    Lin, Z., Cen, S., Jiang, D., Karhade, J., Wang, H., Mitra, C., Ling, Y.T.T., Huang, Y., Zawar, R., Bai, X., et al.: Towards understanding camera motions in any video. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems Datasets and Benchmarks Track (2025)

  31. [31]

    In: WACV

    Lochman, Y., Dobosevych, O., Hryniv, R., Pritts, J.: Minimal solvers for single- view lens-distorted camera auto-calibration. In: WACV. pp. 2887–2896 (2021) 20 B. Li et al

  32. [32]

    In: ICCV

    Lochman, Y., Liepieshov, K., Chen, J., Perdoch, M., Zach, C., Pritts, J.: Babel- calib: A universal approach to calibrating central cameras. In: ICCV. pp. 15253– 15262 (2021)

  33. [33]

    In: CVPR

    Lopez, M., Mari, R., Gargallo, P., Kuang, Y., Gonzalez-Jimenez, J., Haro, G.: Deep single image camera calibration with radial distortion. In: CVPR. pp. 11817–11825 (2019)

  34. [34]

    In: ICRA

    Mei,C.,Rives,P.:Singleviewpointomnidirectionalcameracalibrationfromplanar grids. In: ICRA. pp. 3945–3950. IEEE (2007)

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  36. [36]

    Pautrat, R., Liu, S., Hruby, P., Pollefeys, M., Barath, D.: Vanishing point estima- tioninuncalibrated images withprior gravitydirection. In:ICCV. pp.14118–14127 (2023)

  37. [37]

    In: CVPR

    Pollefeys, M., Van Gool, L.: A stratified approach to metric self-calibration. In: CVPR. pp. 407–412. IEEE (1997)

  38. [38]

    IEEE TPAMI (2020)

    Pritts, J., Kukelova, Z., Larsson, V., Lochman, Y., Chum, O.: Minimal Solvers for Rectifying from Radially-Distorted Conjugate Translations. IEEE TPAMI (2020)

  39. [39]

    In: ICCV

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV. pp. 12179–12188 (2021)

  40. [40]

    In: ECCV

    Sarlin, P.E., Dusmanu, M., Sch"onberger, J.L., Speciale, P., Gruber, L., Larsson, V., Miksik, O., Pollefeys, M.: Lamar: Benchmarking localization and mapping for augmented reality. In: ECCV. pp. 686–704. Springer (2022)

  41. [41]

    In: CVPR

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)

  42. [42]

    In: Conference on robot learning

    Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. In: Conference on robot learning. pp. 492–504. pmlr (2023)

  43. [43]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2016)

    Shi, W., Caballero, J., Huszar, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2016)

  44. [44]

    In: WACV

    Song, X., Kang, H., Moteki, A., Suzuki, G., Kobayashi, Y., Tan, Z.: Mscc: Multi- scale transformers for camera calibration. In: WACV. pp. 3262–3271 (2024)

  45. [45]

    In: ACM MM

    Sumikura, S., Shibuya, M., Sakurada, K.: Openvslam: A versatile visual slam framework. In: ACM MM. pp. 2292–2295 (2019)

  46. [46]

    In: Robotics: Science and systems

    Teichman, A., Miller, S., Thrun, S.: Unsupervised intrinsic calibration of depth sensors via slam. In: Robotics: Science and systems. vol. 248, p. 3 (2013)

  47. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Tirado-Garín, J., Civera, J.: Anycalib: On-manifold learning for model-agnostic single-view camera calibration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8044–8055 (2025)

  48. [48]

    IEEE TPAMI13(4), 376–380 (2002)

    Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE TPAMI13(4), 376–380 (2002)

  49. [49]

    In: Proceedings of the International Conference on 3D Vision (3DV) (2020)

    Vasiljevic, I., Guizilini, V., Ambrus, R., Pillai, S., Burgard, W., Shakhnarovich, G., Gaidon, A.: Neural ray surfaces for self-supervised learning of depth and ego- motion. In: Proceedings of the International Conference on 3D Vision (3DV) (2020)

  50. [50]

    In: European Conference on Com- puter Vision

    Veicht, A., Sarlin, P.E., Lindenberger, P., Pollefeys, M.: Geocalib: Learning single- image calibration with geometric optimization. In: European Conference on Com- puter Vision. pp. 1–20. Springer (2024) Abbreviated paper title 21

  51. [51]

    NeurIPS37, 17743–17760 (2024)

    Wallingford,M.,Bhattad,A.,Kusupati,A.,Ramanujan,V.,Deitke,M.,Kembhavi, A., Mottaghi, R., Ma, W.C., Farhadi, A.: From an image to a scene: Learning to imagine the world from a million 360 videos. NeurIPS37, 17743–17760 (2024)

  52. [52]

    In: CVPR

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR. pp. 5294–5306 (2025)

  53. [53]

    In: CVPR

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)

  54. [54]

    In: IROS

    Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IROS. pp. 4909–4916. IEEE (2020)

  55. [55]

    In: CVPR

    Wang, Y., Pan, L., Pollefeys, M., Larsson, V.: Structure-from-motion with a non- parametric camera model. In: CVPR. pp. 1040–1049 (2025)

  56. [56]

    arXiv preprint arXiv:2102.07064 (2021)

    Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064 (2021)

  57. [57]

    In: ICCV (2019)

    Xian, W., Li, Z., Fisher, M., Eisenmann, J., Shechtman, E., Snavely, N.: Up- rightNet: Geometry-Aware Camera Orientation Estimation from Single Images. In: ICCV (2019)

  58. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

    Zhang, C., Li, B., Wei, M., Cao, Y.P., Gambardella, C.C., Phung, D., Cai, J.: Unified camera positional encoding for controlled video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

  59. [59]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2026)

    Zhang, C., Liang, H., Chen, D.Y., Wu, Q., Plataniotis, K.N., Gambardella, C.C., Cai, J.: Panflow: Decoupled motion control for panoramic video generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2026)

  60. [60]

    In: Proceedings of the seventh ieee international conference on computer vision

    Zhang, Z.: Flexible camera calibration by viewing a plane from unknown orien- tations. In: Proceedings of the seventh ieee international conference on computer vision. vol. 1, pp. 666–673. Ieee (1999)

  61. [61]

    in-the-wild

    Zhang, Z.: A flexible new technique for camera calibration. IEEE TPAMI22(11), 1330–1334 (2000) 22 B. Li et al. A Ablation Study We conduct multiple ablation studies to validate our architectural choices in TartanAir dataset, including the choice of the head architecture and the design of the dense prediction (DPT) head. Overall results are summarized in T...