pith. sign in

arxiv: 2605.24074 · v1 · pith:WUON5G3Wnew · submitted 2026-05-22 · 💻 cs.CV · cs.RO

WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation

Pith reviewed 2026-06-30 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords fisheye depth estimationindoor benchmarkLiDAR ground truthstereo datasetrobotics perceptiondepth completionmodel adaptationmillimeter accuracy
0
0 comments X

The pith

WideDepth supplies the first indoor fisheye depth benchmark with millimeter-accurate LiDAR labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WideDepth as the first indoor dataset for fisheye depth estimation in robotics settings. It contains 101 scenes with 5,000 high-resolution stereo pairs carrying millimeter-level ground truth depth and disparity, plus paired pinhole images and multiple stereo baselines and orientations. The authors also describe a LiDAR-based pipeline to generate the fisheye images and a method for adapting pinhole-trained models to fisheye data. They evaluate current monocular, stereo, and depth-completion models on the benchmark and show that fine-tuning with 18,000 additional sparse LiDAR samples improves performance by up to 62 percent.

Core claim

WideDepth is the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. The dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. A method to adapt pinhole-trained stereo models to fisheye images is proposed together with a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. State-of-the-art models are evaluated on the benchmark, and 18K LiDAR-derived sparse depth training samples are released that achieve up to a 62 percent performance boost wh

What carries the argument

The WideDepth dataset together with its LiDAR-based stereo fisheye image generation and labeling pipeline that supplies millimeter-accurate ground truth.

If this is right

  • State-of-the-art monocular, stereo matching, and depth completion models can be evaluated on fisheye data with precise quantitative metrics.
  • Pinhole-trained stereo models can be adapted to fisheye images using the supplied adaptation method and paired samples.
  • Fine-tuning pinhole-based models with the 18K LiDAR-derived sparse depth samples yields up to 62 percent performance improvement on fisheye data.
  • The benchmark supports research on varying fields of view, baselines, and horizontal versus vertical stereo configurations for indoor robotics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could support development of fisheye-specific network architectures instead of relying only on adaptation from pinhole models.
  • Millimeter accuracy enables testing of depth methods for near-field robotic manipulation tasks where centimeter-scale errors are unacceptable.
  • The LiDAR-to-fisheye generation pipeline could be reused to create training data for other wide-angle camera models or outdoor settings.
  • Direct comparisons across the dataset's horizontal and vertical stereo setups could identify configurations that minimize distortion effects.

Load-bearing premise

The LiDAR scanning and labeling pipeline produces true millimeter-level accuracy for all fisheye views and stereo configurations without systematic bias from calibration, occlusion, or surface properties.

What would settle it

Independent high-precision measurements on the same scenes that show average depth label errors larger than a few millimeters.

Figures

Figures reproduced from arXiv: 2605.24074 by Aleksei Valenkov, Ignat Penshin, Ilia Indyk, Ilya Makarov, Ivan Sosin, Maxim Monastyrny.

Figure 1
Figure 1. Figure 1: Benchmark scene presented in the considered projections: Cassini, Pinhole, Equirectangular, Cubemap, Fisheye. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Intuition for the proposed Depth-Disparity Conversion formulations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: presents key dataset statistics. The depth range histogram shows a long right tail, which can challenge near￾range indoor models. Overall, the primary depth bins are: 0–1 m (6.9%), 2–5 m (74.7%), 5–10 m (9.6%), and 10+ m (1.7%), aligning with typical indoor datasets. To evaluate scene complexity, we calculated local entropy over 11 neigh￾boring pixels. The majority of samples in our benchmark exhibit high … view at source ↗
Figure 5
Figure 5. Figure 5: WideDepth: A high-resolution benchmark for fisheye depth estimation in indoor environments. From lidar-generated point clouds (1), we create [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: All observed monocular depth models perform well on pinhole [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results for stereo with FOV 195 using the StereoBase model. The model shows no degradation from geometric distortion, demonstrating [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of baseline and FOV variations on RelEPE (lower is [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth - the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: https://ilyaind.github.io/WideDepth

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces WideDepth as the first indoor fisheye depth estimation benchmark, comprising 101 scenes with 5K high-resolution stereo pairs that include millimeter-level ground truth depth and disparity labels generated via a high-resolution LiDAR scanning and projection pipeline. It provides paired pinhole and fisheye imagery across multiple fields of view and stereo baselines, proposes an adaptation method for pinhole-trained stereo models to fisheye data, describes a LiDAR-based stereo fisheye image generation pipeline, evaluates state-of-the-art monocular, stereo, and depth completion models, and supplies 18K sparse depth samples that reportedly yield up to a 62% performance boost when fine-tuning pinhole models on fisheye data.

Significance. If the millimeter-level ground truth accuracy can be independently validated, the dataset would address a clear gap by supplying the first dedicated high-precision indoor benchmark for fisheye depth estimation, directly supporting robotics applications that rely on fisheye cameras for near-field manipulation and navigation.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'millimeter-level ground truth depth and disparity' for all fisheye views rests on an unvalidated LiDAR-to-image projection pipeline; no error histograms, calibration target comparisons, or stratified accuracy metrics (by range or angle) are supplied to confirm sub-millimeter extrinsic calibration and distortion model fidelity, which directly undermines the title and dataset utility.
  2. [Abstract] Abstract: the reported 'up to a 62% performance boost' from fine-tuning with the 18K LiDAR-derived sparse depth samples is presented without baseline values, absolute error metrics, or per-model breakdowns, rendering the quantitative claim impossible to evaluate against the stated evaluation of SOTA models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'millimeter-level ground truth depth and disparity' for all fisheye views rests on an unvalidated LiDAR-to-image projection pipeline; no error histograms, calibration target comparisons, or stratified accuracy metrics (by range or angle) are supplied to confirm sub-millimeter extrinsic calibration and distortion model fidelity, which directly undermines the title and dataset utility.

    Authors: We agree that the current manuscript does not supply the requested validation artifacts (error histograms, calibration target comparisons, or stratified metrics) to fully substantiate the millimeter-level claim across fisheye views. The LiDAR projection pipeline is described in Section 3, but explicit quantitative validation of extrinsic calibration and distortion fidelity is missing. We will add these analyses in the revised version, including error distributions, target-based comparisons, and breakdowns by range and angle, to support the title and dataset claims. revision: yes

  2. Referee: [Abstract] Abstract: the reported 'up to a 62% performance boost' from fine-tuning with the 18K LiDAR-derived sparse depth samples is presented without baseline values, absolute error metrics, or per-model breakdowns, rendering the quantitative claim impossible to evaluate against the stated evaluation of SOTA models.

    Authors: We acknowledge that the abstract presents the 62% figure without accompanying baseline values, absolute metrics, or per-model details, which prevents direct evaluation. This result originates from the fine-tuning experiments in Section 5.3 comparing adapted models to their pinhole baselines on fisheye data. In the revision we will expand both the abstract and the experimental section to report the necessary baselines, absolute errors, and per-model breakdowns so the claim can be assessed alongside the SOTA evaluations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset contribution with no derivation chain

full rationale

The paper presents a new fisheye depth dataset generated from LiDAR scans, an adaptation method for stereo models, and empirical evaluations of existing models. No equations, predictions, fitted parameters, or self-citations form a load-bearing derivation that reduces to its own inputs by construction. The millimeter accuracy claim is an assertion about the data collection pipeline rather than a derived result. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified assumption that LiDAR provides millimeter ground truth for fisheye geometry; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Standard stereo calibration and LiDAR-to-camera registration yield millimeter-accurate depth labels for fisheye images.
    Required for the ground-truth claim but not demonstrated in the abstract.

pith-pipeline@v0.9.1-grok · 5771 in / 1091 out tokens · 41279 ms · 2026-06-30T15:41:17.192018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Indoor segmen- tation and support inference from RGBD images,

    P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from RGBD images,” inECCV, 2012

  2. [2]

    SUN RGB-D: A RGB-D scene understanding benchmark suite,

    S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576. [Online]. Available: https://api.semanticscholar.org/CorpusID:6242669

  3. [3]

    Matterport3D: Learning from RGB-D data in indoor environments,

    A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” in2017 International Conference on 3D Vision (3DV), 2017, pp. 667–676. [Online]. Available: https://api.semanticscholar.org/CorpusID:21435690

  4. [4]

    IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation,

    Q. Wang, Z. Shizhen, Q. Yan, F. Deng, K. Zhao, and X. Chu, “IRS: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation,” in2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6. [Online]. Available: https://api.semanticscholar.org/CorpusID:236273594

  5. [5]

    High-resolution stereo datasets with subpixel-accurate ground truth,

    D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGerman Conference on Pattern Recognition, 2014. [Online]. Available: https://api.semanticscholar. org/CorpusID:14915763

  6. [6]

    Open challenges in deep stereo: the booster dataset,

    P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. di Stefano, “Open challenges in deep stereo: the booster dataset,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 136–21 146. [Online]. Available: https://api.semanticscholar.org/CorpusID:249538677

  7. [7]

    SynWoodScape: Synthetic surround-view fisheye camera dataset for autonomous driving,

    A. R. Sekkat, Y . Dupuis, V . R. Kumar, H. Rashed, S. K. Yogamani, P. Vasseur, and P. Honeine, “SynWoodScape: Synthetic surround-view fisheye camera dataset for autonomous driving,”IEEE Robotics and Automation Letters, vol. 7, pp. 8502–8509, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247362954

  8. [8]

    The OmniScape dataset,

    A. R. Sekkat, Y . Dupuis, P. Vasseur, and P. Honeine, “The OmniScape dataset,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1603–1608. [Online]. Available: https://api.semanticscholar.org/CorpusID:221847109

  9. [9]

    1 year, 1000 km: The Oxford RobotCar dataset,

    W. P. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The Oxford RobotCar dataset,”The International Journal of Robotics Research, vol. 36, pp. 15–3, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:22556995

  10. [10]

    KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D,

    Y . Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 3292–3310, 2021. [Online]. Available: https://api.semanticscholar. org/CorpusID:238198653

  11. [11]

    WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving,

    S. K. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uˇriˇc´aˇrand Stefan Milz, M. Simon, K. Amende, C. Witt, H. Rashed, S. Chennupati, S. Nayak, S. Mansoor, X. Perroton, and P. P´erez, “WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving,” in2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019...

  12. [12]

    OmniVidar: omnidirectional depth estimation from multi-fisheye images,

    S. Xie, D. Wang, and Y .-H. Liu, “OmniVidar: omnidirectional depth estimation from multi-fisheye images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 529–21 538

  13. [13]

    MODE: Multi-view omnidirectional depth estimation with 360 ◦ cameras,

    M. Li, X. Jin, X. Hu, J. Dai, S. Du, and Y . Li, “MODE: Multi-view omnidirectional depth estimation with 360 ◦ cameras,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 197–213

  14. [14]

    The double sphere camera model,

    V . Usenko, N. Demmel, and D. Cremers, “The double sphere camera model,” in2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 552–560

  15. [15]

    Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes,

    J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart, “Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes,” in2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 4304–4311

  16. [16]

    Helvipad: A real-world dataset for omnidi- rectional stereo depth estimation,

    M. Zayene, J. Endres, A. Havolli, C. Corbi `ere, S. Cherkaoui, A. Kon- touli, and A. Alahi, “Helvipad: A real-world dataset for omnidi- rectional stereo depth estimation,”arXiv preprint arXiv:2411.18335, 2024

  17. [17]

    Practical stereo matching via cascaded recurrent network with adaptive correlation,

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 263–16 272

  18. [18]

    Depth Anything V2

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”ArXiv, vol. abs/2406.09414, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270440448

  19. [19]

    PatchFusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation,

    Z. L. et al., “PatchFusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation,” in CVPR’24, 2024, pp. 10 016–10 025. [Online]. Available: https: //api.semanticscholar.org/CorpusID:265659202

  20. [20]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    S. B. et al., “ZoeDepth: Zero-shot transfer by combining relative and metric depth,”ArXiv, vol. abs/2302.12288, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257205739

  21. [21]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth Pro: Sharp monocular metric depth in less than a second,” 2024. [Online]. Available: https://arxiv.org/abs/2410.02073

  22. [22]

    UniK3D: Universal camera monocular 3d esti- mation,

    L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d esti- mation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  23. [23]

    Monocular depth estimation using deep learning: A review,

    A. Masoumian, H. A. Rashwan, J. Cristiano, M. S. Asif, and D. Puig, “Monocular depth estimation using deep learning: A review,” Sensors (Basel, Switzerland), vol. 22, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250718139

  24. [24]

    Non-local spatial propagation network for depth completion,

    J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. S. Kweon, “Non-local spatial propagation network for depth completion,” 2020. [Online]. Available: https://arxiv.org/abs/2007.10042

  25. [25]

    CompletionFormer: Depth completion with convolutions and vision transformers,

    Y . Zhang, X. Guo, M. Poggi, Z. Zhu, G. Huang, and S. Mattoccia, “CompletionFormer: Depth completion with convolutions and vision transformers,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 527–18 536. [Online]. Available: https://api.semanticscholar.org/CorpusID:258309598

  26. [26]

    CostDCNet: Cost volume based depth completion for a single RGB-D image,

    J. Kam, S. K. J. Kim, J. Park, and S. Lee, “CostDCNet: Cost volume based depth completion for a single RGB-D image,” in European Conference on Computer Vision, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253513199

  27. [27]

    FADNet++: Real-time and accurate disparity estimation with configurable networks,

    Q. Wang, S. Shi, S. gang Zheng, K. Zhao, and X. Chu, “FADNet++: Real-time and accurate disparity estimation with configurable networks,”ArXiv, vol. abs/2110.02582, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238407735

  28. [28]

    Bilateral grid learning for stereo matching networks,

    B. Xu, Y . Xu, X. Yang, W. Jia, and Y . Guo, “Bilateral grid learning for stereo matching networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1–10

  29. [29]

    Iterative geometry encod- ing volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encod- ing volume for stereo matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 919–21 928

  30. [30]

    Unifying flow, stereo and depth estimation,

    H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger, “Unifying flow, stereo and depth estimation,” 2023. [Online]. Available: https://arxiv.org/abs/2211.05783

  31. [31]

    OpenStereo: A comprehensive benchmark for stereo matching and strong baseline,

    X. Guo, C. Zhang, J. Lu, Y . Wang, Y . Duan, T. Yang, Z. Zhu, and L. Chen, “OpenStereo: A comprehensive benchmark for stereo matching and strong baseline,” 2024. [Online]. Available: https://arxiv.org/abs/2312.00343

  32. [32]

    A survey on deep stereo matching in the twenties,

    F. Tosi, L. Bartolomei, and M. Poggi, “A survey on deep stereo matching in the twenties,”International Journal of Computer Vision, vol. 133, pp. 4245–4276, 2025, appendix C defines EPE, RMSE, bad-τ, and KITTI D1 outlier metrics. [Online]. Available: https://link.springer.com/article/10.1007/s11263-024-02331-0

  33. [33]

    AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients,

    J. Zhuang, T. M. Tang, Y . Ding, S. C. Tatikonda, N. C. Dvornek, X. Papademetris, and J. S. Duncan, “AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients,”ArXiv, vol. abs/2010.07468, 2020. [Online]. Available: https://api.semanticscholar. org/CorpusID:222377595

  34. [34]

    Super-convergence: very fast training of neural networks using large learning rates,

    L. N. Smith and N. Topin, “Super-convergence: very fast training of neural networks using large learning rates,” inDefense + Commercial Sensing, 2018. [Online]. Available: https://api.semanticscholar.org/ CorpusID:260552651