arxiv: 2605.14615 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CalibAnyView: Beyond Single-View Camera Calibration in the Wild

Boying Li , Cheng Zhang , Weirong Chen , Daniel Cremers , Ian Reid , Hamid Rezatofighi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera calibrationmulti-viewtransformergeometric consistencyperspective fieldsintrinsics estimationin-the-wild perception

0 comments

The pith

CalibAnyView enables camera calibration from any number of views in the wild by enforcing cross-view geometric consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CalibAnyView, a method that handles camera calibration for one or more images from everyday, uncontrolled settings. It builds a large dataset of multi-view videos with varied cameras and scenes, then uses a transformer to predict perspective fields that feed into an optimization step for intrinsics and gravity. A sympathetic reader would care because traditional calibration requires special patterns or controlled conditions, while existing learning methods ignore consistency between multiple views, making them less reliable for real videos or robot vision. If successful, this provides a foundation for better 3D perception in natural environments.

Core claim

CalibAnyView is a unified formulation supporting an arbitrary number of input views by explicitly modeling cross-view geometric consistency, constructed via a large-scale multi-view video dataset and a multi-view transformer that predicts dense perspective fields integrated into geometric optimization to estimate camera intrinsics and gravity direction.

What carries the argument

The multi-view transformer predicting dense perspective fields, combined with a geometric optimization framework for joint intrinsics and gravity estimation.

If this is right

Outperforms state-of-the-art single-view calibration methods even in single-view mode.
Performance improves further when multiple views are provided.
Supports reliable 3D reconstruction and robotic perception from in-the-wild imagery.
Works robustly across diverse camera models and lens distortions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Could be integrated into video-based SLAM systems for continuous calibration.
May enable calibration from casual phone videos without special setups.
Opens the door to learning-based calibration for other sensors like depth cameras if similar datasets are built.

Load-bearing premise

The constructed large-scale multi-view video dataset covers sufficient diversity of real-world camera models, dynamic scenes, motion trajectories, and lens distortions to allow generalization to new inputs.

What would settle it

Evaluating the model on a held-out test set featuring camera models or distortion patterns entirely absent from the training dataset and measuring if intrinsic estimates remain accurate.

Figures

Figures reproduced from arXiv: 2605.14615 by Boying Li, Cheng Zhang, Daniel Cremers, Hamid Rezatofighi, Ian Reid, Weirong Chen.

**Figure 2.** Figure 2: Overview of the proposed CalibAnyView framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the dataset construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Calibration performance vs. number of views. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on Stanford2D3D (multi-view, [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on the TartanAir dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on the Stanford2D3D dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on the MegaDepth dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results on the LaMAR dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗

**Figure 22.** Figure 22: Qualitative results on our proposed dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt used for video filtering [PITH_FULL_IMAGE:figures/full_fig_p042_23.png] view at source ↗

**Figure 24.** Figure 24: Synthesized Projected Video Frame Sequences (Part 1). [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗

**Figure 25.** Figure 25: Synthesized Projected Video Frame Sequences (Part 2). [PITH_FULL_IMAGE:figures/full_fig_p044_25.png] view at source ↗

read the original abstract

Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalibAnyView unifies single- and multi-view calibration via cross-view consistency and a new dataset, but the wild generalization claim rests on unverified dataset coverage.

read the letter

The core advance is a single model that works for one or more views by explicitly enforcing geometric consistency across them, plus the accompanying large-scale multi-view video dataset. This moves past the usual single-view learning setups that ignore cross-view relations. The transformer predicting dense perspective fields followed by geometric optimization for intrinsics and gravity is a clean integration, and the claim that adding views improves results follows logically from the consistency term. They also report outperformance over prior methods even in the single-view case, which is a practical plus if the numbers hold. The dataset construction itself is a useful contribution for the community since it targets dynamic scenes and varied distortions. The soft spot is the generalization argument. The paper describes the dataset as covering diverse cameras, motions, and lens effects, but without quantitative checks like distribution comparisons or divergence metrics against independent corpora, the robustness to truly unseen wild conditions stays plausible rather than demonstrated. The abstract metrics are not detailed here, so the size of the gains is hard to judge without the tables. This is aimed at geometric vision researchers who need calibration for robotics, mapping, or AR pipelines. A reader focused on multi-view methods or practical 3D perception would get value from the formulation and dataset if released. It deserves peer review because the unified approach is coherent and the problem matters, even if the evaluation section needs tightening on external validation.

Referee Report

3 major / 2 minor

Summary. The paper introduces CalibAnyView, a unified formulation for camera calibration supporting an arbitrary number of input views (N ≥ 1). It constructs a large-scale multi-view video dataset covering diverse camera models, dynamic scenes, motion trajectories, and lens distortions; trains a multi-view transformer to predict dense perspective fields; and integrates these into a geometric optimization framework to jointly recover intrinsics and gravity direction. The central claim is that the method consistently outperforms prior single-view state-of-the-art approaches, remains robust in single-view settings, and yields further gains under multi-view inference.

Significance. If the quantitative claims hold, the work is significant because it extends learning-based single-view calibration to enforce explicit cross-view geometric consistency without requiring controlled capture setups. The new dataset and the perspective-field-plus-optimization pipeline could serve as a practical foundation for downstream tasks such as 3D reconstruction and robotic perception in unconstrained environments.

major comments (3)

[Dataset Construction] Dataset section: the assertion that the constructed multi-view video dataset sufficiently covers real-world diversity (camera models, lens distortions, trajectories) is load-bearing for the generalization claim, yet no statistical comparison (histograms, divergence metrics, or coverage statistics) is provided against independent corpora such as KITTI, nuScenes, or existing calibration benchmarks.
[Experiments] Experiments section: the abstract states that CalibAnyView 'consistently outperforms state-of-the-art methods' and 'further improves with multi-view inference,' but the manuscript supplies no numerical tables, baseline implementations, ablation results on view count, or error metrics (e.g., focal-length error, distortion coefficient error, gravity angular error). Without these, the central performance claim cannot be verified.
[Method] Geometric optimization: the integration step that converts predicted perspective fields into joint intrinsics and gravity estimates is described at a high level; the precise objective function, handling of N > 2 views, and any weighting between single-view and cross-view terms are not specified, making it impossible to assess whether reported multi-view gains arise from genuine consistency enforcement or from simple averaging.

minor comments (2)

[Abstract] Abstract: reporting 'consistent outperformance' without any numerical values is atypical and reduces immediate readability; a single sentence summarizing key error reductions would help.
[Method] Notation: the term 'perspective fields' is introduced without an early formal definition or diagram; a short equation or figure in §3 would clarify the representation before the transformer architecture is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and will revise the manuscript to improve clarity, completeness, and verifiability of the claims.

read point-by-point responses

Referee: [Dataset Construction] Dataset section: the assertion that the constructed multi-view video dataset sufficiently covers real-world diversity (camera models, lens distortions, trajectories) is load-bearing for the generalization claim, yet no statistical comparison (histograms, divergence metrics, or coverage statistics) is provided against independent corpora such as KITTI, nuScenes, or existing calibration benchmarks.

Authors: We agree that quantitative coverage analysis would strengthen the generalization argument. In the revised manuscript we will add histograms of focal lengths, distortion coefficients, scene dynamics, and motion trajectories, along with divergence metrics (e.g., KL divergence or Wasserstein distance) comparing our dataset against KITTI, nuScenes, and standard calibration benchmarks. revision: yes
Referee: [Experiments] Experiments section: the abstract states that CalibAnyView 'consistently outperforms state-of-the-art methods' and 'further improves with multi-view inference,' but the manuscript supplies no numerical tables, baseline implementations, ablation results on view count, or error metrics (e.g., focal-length error, distortion coefficient error, gravity angular error). Without these, the central performance claim cannot be verified.

Authors: We will expand the experiments section to include complete numerical tables with all requested error metrics (focal-length error, distortion coefficients, gravity angular error), explicit baseline implementations, and ablation studies that vary the number of input views (N=1 to N=8). These additions will make the performance claims directly verifiable. revision: yes
Referee: [Method] Geometric optimization: the integration step that converts predicted perspective fields into joint intrinsics and gravity estimates is described at a high level; the precise objective function, handling of N > 2 views, and any weighting between single-view and cross-view terms are not specified, making it impossible to assess whether reported multi-view gains arise from genuine consistency enforcement or from simple averaging.

Authors: We will provide the exact objective function (including the formulation for arbitrary N>2), the weighting coefficients between single-view and cross-view terms, and a clear description of how perspective-field residuals are aggregated across views. This will demonstrate that the reported gains result from explicit geometric consistency rather than averaging. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard supervised pipeline with independent geometric optimization

full rationale

The paper constructs a multi-view video dataset, trains a transformer to predict dense perspective fields from it, and feeds those predictions into a separate geometric optimization step that enforces cross-view consistency to recover intrinsics and gravity. This chain follows conventional supervised learning plus post-hoc optimization; the optimization equations operate on the model's outputs rather than re-fitting parameters from the same training data, and no load-bearing self-citation or self-definitional loop reduces the claimed predictions to the inputs by construction. Dataset creation and model training introduce ordinary distribution dependence but do not create the circular reductions enumerated in the analysis criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus one domain-specific modeling choice; no new physical entities are postulated.

free parameters (1)

Transformer hyperparameters and training schedule
Typical learned weights and architectural choices tuned on the new dataset to produce perspective fields.

axioms (1)

domain assumption Perspective fields can encode sufficient cross-view geometric consistency for joint intrinsics and gravity estimation
Invoked in the unified formulation and optimization framework described in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 1218 out tokens · 38134 ms · 2026-05-15T05:56:58.310994+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

predicts dense perspective fields (up-vector and latitude) ... integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified formulation that supports an arbitrary number of input views (N ≥ 1) by explicitly modeling cross-view geometric consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 4 internal anchors

[1]

Communications of the ACM54(10), 105–112 (2011)

Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Communications of the ACM54(10), 105–112 (2011)

work page 2011
[2]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Armeni, I., Sax, S., Zamir, A.R., Savarese, S.: Joint 2D-3D-Semantic Data for Indoor Scene Understanding. arXiv:1702.01105 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: CVPR 2011

Bao, S.Y., Savarese, S.: Semantic structure from motion. In: CVPR 2011. pp. 2025–

work page 2011
[5]

In: CVMP

Bogdan, O., Eckstein, V., Rameau, F., Bazin, J.C.: Deepcalib: A deep learning ap- proach for automatic intrinsic calibration of wide field-of-view cameras. In: CVMP. pp. 1–10 (2018)

work page 2018
[6]

IJCV4(2), 127–139 (1990)

Caprile, B., Torre, V.: Using vanishing points for camera calibration. IJCV4(2), 127–139 (1990)

work page 1990
[7]

In: IEEE International Conference on Robotics and Automation

Carrera, G., Angeli, A., Davison, A.J.: Slam-based automatic extrinsic calibra- tion of a multi-camera rig. In: IEEE International Conference on Robotics and Automation. pp. 2652–2659. IEEE (2011)

work page 2011
[8]

In: BMVC

Cipolla, R., Drummond, T., Robertson, D.P.: Camera calibration from vanishing points in image of architectural scenes. In: BMVC. pp. 382–391 (1999)

work page 1999
[9]

In: IEEE International Conference on Robotics and Automation

Civera, J., Bueno, D.R., Davison, A.J., Montiel, J.M.M.: Camera self-calibration for sequential bayesian structure from motion. In: IEEE International Conference on Robotics and Automation. pp. 403–408. IEEE (2009)

work page 2009
[10]

In: ICCV (1999)

Coughlan, J.M., Yuille, A.L.: Manhattan World: Compass Direction from a Bayesian Inference. In: ICCV (1999)

work page 1999
[11]

Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry40(3), 611–625 (2017)

work page 2017
[12]

In: ICRA

Fang, J., Vasiljevic, I., Guizilini, V., Ambrus, R., Shakhnarovich, G., Gaidon, A., Walter, M.R.: Self-supervised camera self-calibration from video. In: ICRA. pp. 8468–8475. IEEE (2022) Abbreviated paper title 19

work page 2022
[13]

GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

Fortier-Chouinard, F., Hold-Geoffroy, Y., Deschaintre, V., Gadelha, M., Lalonde, J.F.: Gimbaldiffusion: Gravity-aware camera control for video generation. arXiv preprint arXiv:2512.09112 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

In: ICCV

Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV. pp. 8977–8986 (2019)

work page 2019
[15]

In: ICCV

Hagemann, A., Knorr, M., Stiller, C.: Deep geometry-aware camera self-calibration from video. In: ICCV. pp. 3438–3448 (2023)

work page 2023
[16]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

Hagemann, A., Knorr, M., Stiller, C.: Deep geometry-aware camera self-calibration from video. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 3438–3448 (2023)

work page 2023
[17]

Autonomous robots39(3), 259–277 (2015)

Heng, L., Lee, G.H., Pollefeys, M.: Self-calibration and visual slam with a multi- camera system on a micro aerial vehicle. Autonomous robots39(3), 259–277 (2015)

work page 2015
[18]

IEEE TPAMI (2022)

Hold-Geoffroy, Y., Piché-Meunier, D., Sunkavalli, K., Bazin, J.C., Rameau, F., Lalonde, J.F.: A Deep Perceptual Measure for Lens and Camera Calibration. IEEE TPAMI (2022)

work page 2022
[19]

In: CVPR

Hold-Geoffroy, Y., Sunkavalli, K., Eisenmann, J., Fisher, M., Gambaretto, E., Hadap, S., Lalonde, J.F.: A perceptual measure for deep single image camera cal- ibration. In: CVPR. pp. 2354–2363 (2018)

work page 2018
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Huang, H., Liu, C., Zhu, Y., Cheng, H., Braud, T., Yeung, S.K.: 360loc: A dataset and benchmark for omnidirectional visual localization with cross-device queries. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 22314–22324 (2024)

work page 2024
[21]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev,D.,Lin,C.H.,etal.:Vipe:Videoposeenginefor3dgeometricperception. arXiv preprint arXiv:2508.10934 (2025)

work page arXiv 2025
[22]

In: ICCV (2021)

Jeong, Y., Ahn, S., Choy, C., Anandkumar, A., Cho, M., Park, J.: Self-calibrating neural radiance fields. In: ICCV (2021)

work page 2021
[23]

In: CVPR

Jin, L., Zhang, J., Hold-Geoffroy, Y., Wang, O., Blackburn-Matzen, K., Sticha, M., Fouhey, D.F.: Perspective fields for single image camera calibration. In: CVPR. pp. 17307–17316 (2023)

work page 2023
[24]

Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS30(2017)

work page 2017
[25]

In: CVPR (2020)

Kluger, F., Brachmann, E., Ackermann, H., Rother, C., Yang, M.Y., Rosenhahn, B.: CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus. In: CVPR (2020)

work page 2020
[26]

In: ECCV

Košecká, J., Zhang, W.: Video compass. In: ECCV. pp. 476–490. Springer (2002)

work page 2002
[27]

In: ICCV (2021)

Lee, J., Go, H., Lee, H., Cho, S., Sung, M., Kim, J.: Ctrl-c: Camera calibration transformer with line-classification. In: ICCV (2021)

work page 2021
[28]

In: CVPR

Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: CVPR. pp. 2041–2050 (2018)

work page 2041
[29]

arXiv preprint arXiv:2510.23589 (2025)

Liang, E., Bhattacharjee, R., Dey, S., Moschopoulos, R., Wang, C., Liao, M., Tan, G., Wang, A., Kayan, K., Alexandropoulos, S., et al.: Influx: A benchmark for self- calibration of dynamic intrinsics of video cameras. arXiv preprint arXiv:2510.23589 (2025)

work page arXiv 2025
[30]

In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems Datasets and Benchmarks Track (2025)

Lin, Z., Cen, S., Jiang, D., Karhade, J., Wang, H., Mitra, C., Ling, Y.T.T., Huang, Y., Zawar, R., Bai, X., et al.: Towards understanding camera motions in any video. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems Datasets and Benchmarks Track (2025)

work page 2025
[31]

In: WACV

Lochman, Y., Dobosevych, O., Hryniv, R., Pritts, J.: Minimal solvers for single- view lens-distorted camera auto-calibration. In: WACV. pp. 2887–2896 (2021) 20 B. Li et al

work page 2021
[32]

In: ICCV

Lochman, Y., Liepieshov, K., Chen, J., Perdoch, M., Zach, C., Pritts, J.: Babel- calib: A universal approach to calibrating central cameras. In: ICCV. pp. 15253– 15262 (2021)

work page 2021
[33]

In: CVPR

Lopez, M., Mari, R., Gargallo, P., Kuang, Y., Gonzalez-Jimenez, J., Haro, G.: Deep single image camera calibration with radial distortion. In: CVPR. pp. 11817–11825 (2019)

work page 2019
[34]

In: ICRA

Mei,C.,Rives,P.:Singleviewpointomnidirectionalcameracalibrationfromplanar grids. In: ICRA. pp. 3945–3950. IEEE (2007)

work page 2007
[35]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Pautrat, R., Liu, S., Hruby, P., Pollefeys, M., Barath, D.: Vanishing point estima- tioninuncalibrated images withprior gravitydirection. In:ICCV. pp.14118–14127 (2023)

work page 2023
[37]

In: CVPR

Pollefeys, M., Van Gool, L.: A stratified approach to metric self-calibration. In: CVPR. pp. 407–412. IEEE (1997)

work page 1997
[38]

IEEE TPAMI (2020)

Pritts, J., Kukelova, Z., Larsson, V., Lochman, Y., Chum, O.: Minimal Solvers for Rectifying from Radially-Distorted Conjugate Translations. IEEE TPAMI (2020)

work page 2020
[39]

In: ICCV

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV. pp. 12179–12188 (2021)

work page 2021
[40]

In: ECCV

Sarlin, P.E., Dusmanu, M., Sch"onberger, J.L., Speciale, P., Gruber, L., Larsson, V., Miksik, O., Pollefeys, M.: Lamar: Benchmarking localization and mapping for augmented reality. In: ECCV. pp. 686–704. Springer (2022)

work page 2022
[41]

In: CVPR

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)

work page 2016
[42]

In: Conference on robot learning

Shah, D., Osiński, B., Levine, S., et al.: Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. In: Conference on robot learning. pp. 492–504. pmlr (2023)

work page 2023
[43]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2016)

Shi, W., Caballero, J., Huszar, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2016)

work page 2016
[44]

In: WACV

Song, X., Kang, H., Moteki, A., Suzuki, G., Kobayashi, Y., Tan, Z.: Mscc: Multi- scale transformers for camera calibration. In: WACV. pp. 3262–3271 (2024)

work page 2024
[45]

In: ACM MM

Sumikura, S., Shibuya, M., Sakurada, K.: Openvslam: A versatile visual slam framework. In: ACM MM. pp. 2292–2295 (2019)

work page 2019
[46]

In: Robotics: Science and systems

Teichman, A., Miller, S., Thrun, S.: Unsupervised intrinsic calibration of depth sensors via slam. In: Robotics: Science and systems. vol. 248, p. 3 (2013)

work page 2013
[47]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Tirado-Garín, J., Civera, J.: Anycalib: On-manifold learning for model-agnostic single-view camera calibration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8044–8055 (2025)

work page 2025
[48]

IEEE TPAMI13(4), 376–380 (2002)

Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE TPAMI13(4), 376–380 (2002)

work page 2002
[49]

In: Proceedings of the International Conference on 3D Vision (3DV) (2020)

Vasiljevic, I., Guizilini, V., Ambrus, R., Pillai, S., Burgard, W., Shakhnarovich, G., Gaidon, A.: Neural ray surfaces for self-supervised learning of depth and ego- motion. In: Proceedings of the International Conference on 3D Vision (3DV) (2020)

work page 2020
[50]

In: European Conference on Com- puter Vision

Veicht, A., Sarlin, P.E., Lindenberger, P., Pollefeys, M.: Geocalib: Learning single- image calibration with geometric optimization. In: European Conference on Com- puter Vision. pp. 1–20. Springer (2024) Abbreviated paper title 21

work page 2024
[51]

NeurIPS37, 17743–17760 (2024)

Wallingford,M.,Bhattad,A.,Kusupati,A.,Ramanujan,V.,Deitke,M.,Kembhavi, A., Mottaghi, R., Ma, W.C., Farhadi, A.: From an image to a scene: Learning to imagine the world from a million 360 videos. NeurIPS37, 17743–17760 (2024)

work page 2024
[52]

In: CVPR

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR. pp. 5294–5306 (2025)

work page 2025
[53]

In: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)

work page 2024
[54]

In: IROS

Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IROS. pp. 4909–4916. IEEE (2020)

work page 2020
[55]

In: CVPR

Wang, Y., Pan, L., Pollefeys, M., Larsson, V.: Structure-from-motion with a non- parametric camera model. In: CVPR. pp. 1040–1049 (2025)

work page 2025
[56]

arXiv preprint arXiv:2102.07064 (2021)

Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064 (2021)

work page arXiv 2021
[57]

In: ICCV (2019)

Xian, W., Li, Z., Fisher, M., Eisenmann, J., Shechtman, E., Snavely, N.: Up- rightNet: Geometry-Aware Camera Orientation Estimation from Single Images. In: ICCV (2019)

work page 2019
[58]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

Zhang, C., Li, B., Wei, M., Cao, Y.P., Gambardella, C.C., Phung, D., Cai, J.: Unified camera positional encoding for controlled video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

work page 2026
[59]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2026)

Zhang, C., Liang, H., Chen, D.Y., Wu, Q., Plataniotis, K.N., Gambardella, C.C., Cai, J.: Panflow: Decoupled motion control for panoramic video generation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2026)

work page 2026
[60]

In: Proceedings of the seventh ieee international conference on computer vision

Zhang, Z.: Flexible camera calibration by viewing a plane from unknown orien- tations. In: Proceedings of the seventh ieee international conference on computer vision. vol. 1, pp. 666–673. Ieee (1999)

work page 1999
[61]

in-the-wild

Zhang, Z.: A flexible new technique for camera calibration. IEEE TPAMI22(11), 1330–1334 (2000) 22 B. Li et al. A Ablation Study We conduct multiple ablation studies to validate our architectural choices in TartanAir dataset, including the choice of the head architecture and the design of the dense prediction (DPT) head. Overall results are summarized in T...

work page 2000