Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

Xiangzhong Liu

arxiv: 2606.08680 · v1 · pith:YZY57753new · submitted 2026-06-07 · 💻 cs.CV · cs.RO

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

Xiangzhong Liu This is my paper

Pith reviewed 2026-06-27 18:30 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords BEV detectionfisheye camerasdistortion awarenesspositional embedding3D object detectionautonomous drivingmixed cameras

0 comments

The pith

DAPETR introduces distortion-aware positional embeddings and co-modulation to enable effective BEV detection with mixed pinhole-fisheye cameras.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a projection-free detector can handle the radial distortion of fisheye cameras in BEV object detection by using learned adaptive modules instead of rectifying images. It proposes two modules: a unified distortion-aware positional embedding and a bidirectional feature-geometry co-modulation. Experiments on a converted KITTI-360 benchmark show these modules outperform both the baseline and a polar coordinate version, though combining them causes issues. This matters because fisheye cameras offer wide coverage at low cost but break standard uniform sampling assumptions in 3D detectors.

Core claim

DAPETR advances fisheye BEV detection by incorporating a unified distortion-aware positional embedding that harmonizes positional encodings with fisheye geometry and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings, achieving superior performance over PETR and PolarPETR on the converted KITTI-360 benchmark while revealing a negative interaction between learned adaptation and explicit geometric reparameterization.

What carries the argument

The unified distortion-aware positional embedding and bidirectional feature-geometry co-modulation module that allow adaptation to fisheye geometry without projection or rectification.

If this is right

Both learned adaptive modules and polar reparameterization improve over the PETR baseline for fisheye cameras.
The learned modules in DAPETR achieve better performance than PolarPETR.
Combining the learned adaptation with explicit geometric reparameterization leads to negative interaction and reduced performance.
The approach provides insights for distortion-aware 3D perception designs that avoid image rectification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adaptive learning approaches may generalize better to real mixed camera setups than fixed geometric transformations.
Future detectors could integrate these modules to handle a wider variety of camera distortions without preprocessing steps.
Testing on diverse real-world data could confirm if the benchmark results hold beyond the converted KITTI-360 dataset.

Load-bearing premise

The converted KITTI-360 benchmark faithfully represents real mixed pinhole-fisheye camera setups and performance differences are due to the proposed modules.

What would settle it

Running DAPETR and PolarPETR on raw data from an actual vehicle equipped with both pinhole and fisheye cameras to check if the performance advantage persists.

Figures

Figures reproduced from arXiv: 2606.08680 by Xiangzhong Liu.

**Figure 2.** Figure 2: An overview of the Distortion-Aware PETR pipeline. Multi-view images from mixed pinhole and fisheye cameras are fed into an image backbone. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The architecture of our spatial FiLM module. Unprojected rays [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between the baseline PETR (left) and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Model robustness under camera failure scenarios. FL, FB and SL [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAPETR adds two learned modules for fisheye distortion in mixed-camera PETR and reports they beat PolarPETR with a negative interaction, but the converted KITTI-360 benchmark has no described procedure.

read the letter

The key takeaway is that this paper introduces two specific learned modules for PETR to handle mixed pinhole-fisheye cameras in BEV detection and reports that they outperform polar coordinate reparameterization on a converted benchmark while also showing a negative interaction between the two strategies.

What is new is the unified distortion-aware positional embedding and the bidirectional co-modulation module, both designed to adapt to fisheye radial distortion without rectification. The paper does a solid job of setting up the practical problem and directly comparing the learned approach to PolarPETR to highlight the design choice.

The soft spot is the benchmark. All the performance claims and the interaction insight depend on a "converted" KITTI-360 dataset, yet the abstract provides no description of the conversion method, how fisheye parameters were mixed in, or any checks against real fisheye sequences. This makes it hard to rule out that the gains come from how the data was prepared rather than the modules themselves. The negative interaction finding is interesting but shares the same weakness.

This paper is aimed at people working on BEV detectors for autonomous driving who need to incorporate fisheye cameras. A reader in that area might pick up ideas from the modules, but anyone planning to use the results would want to see the full benchmark construction and ablations.

I would send it for peer review because the topic matters and the modules are a reasonable attempt, provided the full paper supplies the missing benchmark details and stronger evidence.

Referee Report

2 major / 2 minor

Summary. The paper proposes Distortion-Aware PETR (DAPETR) for BEV 3D object detection on mixed pinhole-fisheye camera rigs. It introduces two learned modules—a unified distortion-aware positional embedding that adapts encodings to fisheye geometry and a bidirectional feature-geometry co-modulation module that mutually refines image features and 3D embeddings—without explicit rectification or polar reparameterization. On a converted KITTI-360 benchmark the authors report that DAPETR outperforms both standard PETR and PolarPETR, while the two adaptation strategies interact negatively when combined.

Significance. If the benchmark conversion faithfully reproduces real mixed-camera geometry and the reported deltas are attributable to the modules rather than conversion artifacts, the work supplies a concrete empirical comparison between learned adaptive positional encodings and explicit geometric reparameterization for fisheye BEV detection, together with the observation of negative interaction. This could inform design choices in distortion-aware perception pipelines.

major comments (2)

[Experiments / benchmark description] The central empirical claims rest on a “converted KITTI-360 benchmark” whose construction is not described in the abstract or visible experimental section. No procedure is given for embedding radial distortion into the original pinhole images, for maintaining consistent 3D-to-2D correspondences across camera types, or for cross-validating against native fisheye sequences. Without these details it is impossible to rule out that measured gains over PolarPETR arise from benchmark artifacts rather than the proposed modules (§ on experiments / benchmark construction).
[Ablation studies] The headline result that the two strategies “interact negatively” is presented as a design insight, yet no ablation table, interaction term, or statistical test is referenced that isolates the interaction from other factors (e.g., training schedule, embedding dimensionality). This interaction is load-bearing for the claim that learned adaptation and explicit reparameterization conflict.

minor comments (2)

[Method] Notation for the distortion-aware positional embedding and the co-modulation module should be introduced with explicit equations rather than prose descriptions only.
[Abstract / Experiments] The abstract states that DAPETR “significantly advances the research and benchmark,” but no quantitative comparison to prior fisheye BEV detectors (outside PETR/PolarPETR) is mentioned; a broader baseline table would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity on the benchmark and strengthen the ablation analysis.

read point-by-point responses

Referee: [Experiments / benchmark description] The central empirical claims rest on a “converted KITTI-360 benchmark” whose construction is not described in the abstract or visible experimental section. No procedure is given for embedding radial distortion into the original pinhole images, for maintaining consistent 3D-to-2D correspondences across camera types, or for cross-validating against native fisheye sequences. Without these details it is impossible to rule out that measured gains over PolarPETR arise from benchmark artifacts rather than the proposed modules (§ on experiments / benchmark construction).

Authors: We acknowledge that the description of the benchmark conversion process requires expansion for full reproducibility. The experimental section does outline the conversion from KITTI-360 pinhole images, but we agree it lacks sufficient procedural detail. In the revised manuscript we will add a dedicated subsection describing the radial distortion embedding method, how 3D-to-2D correspondences are preserved across camera models, and any validation against native fisheye data. This will help demonstrate that performance differences are attributable to the proposed modules. revision: yes
Referee: [Ablation studies] The headline result that the two strategies “interact negatively” is presented as a design insight, yet no ablation table, interaction term, or statistical test is referenced that isolates the interaction from other factors (e.g., training schedule, embedding dimensionality). This interaction is load-bearing for the claim that learned adaptation and explicit reparameterization conflict.

Authors: The negative interaction is shown via direct comparison of the four combinations (baseline, distortion-aware embedding only, co-modulation only, and both) in our ablation experiments. However, the referee is correct that an explicit interaction term or statistical test isolating confounding factors is not provided. We will add a more granular ablation table with additional controls and a brief statistical note in the revision to better substantiate the interaction claim. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical module comparison on external benchmark

full rationale

The paper introduces two learned modules (distortion-aware positional embedding and bidirectional co-modulation) and reports performance gains versus PETR and PolarPETR on a converted KITTI-360 benchmark. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are experimental outcomes rather than derivations that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated beyond the two learned modules.

axioms (1)

domain assumption The converted KITTI-360 benchmark is a valid proxy for real mixed-camera fisheye perception.
Central to all reported performance claims.

pith-pipeline@v0.9.1-grok · 5764 in / 1115 out tokens · 20930 ms · 2026-06-27T18:30:51.345032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Surround-view fisheye camera perception for automated driving: Overview, survey & challenges,

V . R. Kumar, C. Eising, C. Witt, and S. K. Yogamani, “Surround-view fisheye camera perception for automated driving: Overview, survey & challenges,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 3638–3659, 2023

2023
[2]

Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,

H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao, “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20, 2023

2023
[3]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020

2020
[4]

Petr: Position embedding transformation for multi-view 3d object detection,

Y . Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” inEuropean con- ference on computer vision, pp. 531–548, Springer, 2022

2022
[5]

Polarformer: Multi-camera 3d object detection with polar transformer,

Y . Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y .-G. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformer,” inProceedings of the AAAI conference on Artificial Intelligence, vol. 37, pp. 1042–1050, 2023

2023
[6]

Polarbevdet: Exploring polar representation for multi-view 3d object detection in bird’s-eye-view,

Z. Yu, Q. Liu, W. Wang, L. Zhang, and X. Zhao, “Polarbevdet: Exploring polar representation for multi-view 3d object detection in bird’s-eye-view,”arXiv preprint arXiv:2408.16200, 2024

work page arXiv 2024
[7]

Polardetr: Polar parametrization for vision-based surround-view 3d detection,

S. Chen, X. Wang, T. Cheng, Q. Zhang, C. Huang, and W. Liu, “Polardetr: Polar parametrization for vision-based surround-view 3d detection,”Image and Vision Computing, vol. 156, p. 105438, 2025

2025
[8]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

Y . Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022

2022
[9]

Single view point omnidirectional camera calibration from planar grids,

C. Mei and P. Rives, “Single view point omnidirectional camera calibration from planar grids,” inProceedings 2007 IEEE International Conference on Robotics and Automation, pp. 3945–3950, IEEE, 2007

2007
[10]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” inEuropean conference on computer vision, pp. 194–210, Springer, 2020

2020
[11]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

J. Huang, G. Huang, Z. Zhu, Y . Ye, and D. Du, “Bevdet: High- performance multi-camera 3d object detection in bird-eye-view,”arXiv preprint arXiv:2112.11790, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Bevdet4d: Exploit temporal cues in multi- camera 3d object detection,

J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi- camera 3d object detection,”arXiv preprint arXiv:2203.17054, 2022

work page arXiv 2022
[13]

Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[14]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,

Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” inConference on robot learning, pp. 180–191, PMLR, 2022

2022
[15]

F2bev: Bird’s eye view generation from surround-view fisheye cam- era images for automated driving,

E. U. Samani, F. Tao, H. R. Dasari, S. Ding, and A. G. Banerjee, “F2bev: Bird’s eye view generation from surround-view fisheye cam- era images for automated driving,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9367–9374, IEEE, 2023

2023
[16]

Fisheye- bevseg: Surround view fisheye cameras based bird’s-eye view seg- mentation for autonomous driving,

S. Yogamani, D. Unger, V . Narayanan, and V . R. Kumar, “Fisheye- bevseg: Surround view fisheye cameras based bird’s-eye view seg- mentation for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1331– 1334, 2024

2024
[17]

Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,

S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uric ´ar, S. Milz, M. Simon, K. Amende,et al., “Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9308–9318, 2019

2019
[18]

Fisheyedetnet: 360{\deg}surround view fisheye camera based object detection system for autonomous driving,

G. Sistu and S. Yogamani, “Fisheyedetnet: 360{\deg}surround view fisheye camera based object detection system for autonomous driving,” arXiv preprint arXiv:2404.13443, 2024

work page arXiv 2024
[19]

Heal-swin: A vision transformer on the sphere,

O. Carlsson, J. E. Gerken, H. Linander, H. Spieß, F. Ohlsson, C. Peters- son, and D. Persson, “Heal-swin: A vision transformer on the sphere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6067–6077, 2024

2024
[20]

Darswin: Distortion aware radial swin transformer,

A. Athwale, A. Afrasiyabi, J. Lag ¨ue, I. Shili, O. Ahmad, and J.- F. Lalonde, “Darswin: Distortion aware radial swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5929–5938, 2023

2023
[21]

Cam-convs: Camera-aware multi-scale convolutions for single-view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single-view depth,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11826–11835, 2019

2019
[22]

Sensor equivariance: A framework for semantic segmentation with diverse camera models,

H. Reichert, M. Hetzel, A. Hubert, K. Doll, and B. Sick, “Sensor equivariance: A framework for semantic segmentation with diverse camera models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1254–1261, 2024

2024
[23]

Adapting cnns for fisheye cameras without retraining,

R. Griffiths and D. G. Dansereau, “Adapting cnns for fisheye cameras without retraining,”arXiv preprint arXiv:2404.08187, 2024

work page arXiv 2024
[24]

Convolution kernel adaptation to calibrated fisheye,

B. Berenguel-Baeta, M. Santos-Villafranca, J. Bermudez-Cameo, A. P. Yus, and J. Guerrero, “Convolution kernel adaptation to calibrated fisheye,” in34th British Machine Vision Conference 2023, BMVC 2023, Aberdeen, UK, November 20-24, 2023, BMV A, 2023

2023
[25]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, 2018

2018
[26]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

2018
[27]

Petrv2: A unified framework for 3d perception from multi-camera images,

Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 3262–3272, 2023

2023
[28]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020

2020
[29]

Benchmarking multi-view bev object detection with mixed pinhole and fisheye cameras,

X. Liu and H. Shen, “Benchmarking multi-view bev object detection with mixed pinhole and fisheye cameras,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), IEEE,
[30]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

MMDetection3D: OpenMMLab next-generation platform for general 3D object detection

M. Contributors, “MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.”https://github.com/ open-mmlab/mmdetection3d, 2020

2020

[1] [1]

Surround-view fisheye camera perception for automated driving: Overview, survey & challenges,

V . R. Kumar, C. Eising, C. Witt, and S. K. Yogamani, “Surround-view fisheye camera perception for automated driving: Overview, survey & challenges,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 3638–3659, 2023

2023

[2] [2]

Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,

H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao, “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–20, 2023

2023

[3] [3]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020

2020

[4] [4]

Petr: Position embedding transformation for multi-view 3d object detection,

Y . Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” inEuropean con- ference on computer vision, pp. 531–548, Springer, 2022

2022

[5] [5]

Polarformer: Multi-camera 3d object detection with polar transformer,

Y . Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y .-G. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformer,” inProceedings of the AAAI conference on Artificial Intelligence, vol. 37, pp. 1042–1050, 2023

2023

[6] [6]

Polarbevdet: Exploring polar representation for multi-view 3d object detection in bird’s-eye-view,

Z. Yu, Q. Liu, W. Wang, L. Zhang, and X. Zhao, “Polarbevdet: Exploring polar representation for multi-view 3d object detection in bird’s-eye-view,”arXiv preprint arXiv:2408.16200, 2024

work page arXiv 2024

[7] [7]

Polardetr: Polar parametrization for vision-based surround-view 3d detection,

S. Chen, X. Wang, T. Cheng, Q. Zhang, C. Huang, and W. Liu, “Polardetr: Polar parametrization for vision-based surround-view 3d detection,”Image and Vision Computing, vol. 156, p. 105438, 2025

2025

[8] [8]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

Y . Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022

2022

[9] [9]

Single view point omnidirectional camera calibration from planar grids,

C. Mei and P. Rives, “Single view point omnidirectional camera calibration from planar grids,” inProceedings 2007 IEEE International Conference on Robotics and Automation, pp. 3945–3950, IEEE, 2007

2007

[10] [10]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” inEuropean conference on computer vision, pp. 194–210, Springer, 2020

2020

[11] [11]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

J. Huang, G. Huang, Z. Zhu, Y . Ye, and D. Du, “Bevdet: High- performance multi-camera 3d object detection in bird-eye-view,”arXiv preprint arXiv:2112.11790, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Bevdet4d: Exploit temporal cues in multi- camera 3d object detection,

J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi- camera 3d object detection,”arXiv preprint arXiv:2203.17054, 2022

work page arXiv 2022

[13] [13]

Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bev- former: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[14] [14]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,

Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” inConference on robot learning, pp. 180–191, PMLR, 2022

2022

[15] [15]

F2bev: Bird’s eye view generation from surround-view fisheye cam- era images for automated driving,

E. U. Samani, F. Tao, H. R. Dasari, S. Ding, and A. G. Banerjee, “F2bev: Bird’s eye view generation from surround-view fisheye cam- era images for automated driving,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9367–9374, IEEE, 2023

2023

[16] [16]

Fisheye- bevseg: Surround view fisheye cameras based bird’s-eye view seg- mentation for autonomous driving,

S. Yogamani, D. Unger, V . Narayanan, and V . R. Kumar, “Fisheye- bevseg: Surround view fisheye cameras based bird’s-eye view seg- mentation for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1331– 1334, 2024

2024

[17] [17]

Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,

S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uric ´ar, S. Milz, M. Simon, K. Amende,et al., “Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9308–9318, 2019

2019

[18] [18]

Fisheyedetnet: 360{\deg}surround view fisheye camera based object detection system for autonomous driving,

G. Sistu and S. Yogamani, “Fisheyedetnet: 360{\deg}surround view fisheye camera based object detection system for autonomous driving,” arXiv preprint arXiv:2404.13443, 2024

work page arXiv 2024

[19] [19]

Heal-swin: A vision transformer on the sphere,

O. Carlsson, J. E. Gerken, H. Linander, H. Spieß, F. Ohlsson, C. Peters- son, and D. Persson, “Heal-swin: A vision transformer on the sphere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6067–6077, 2024

2024

[20] [20]

Darswin: Distortion aware radial swin transformer,

A. Athwale, A. Afrasiyabi, J. Lag ¨ue, I. Shili, O. Ahmad, and J.- F. Lalonde, “Darswin: Distortion aware radial swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5929–5938, 2023

2023

[21] [21]

Cam-convs: Camera-aware multi-scale convolutions for single-view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single-view depth,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11826–11835, 2019

2019

[22] [22]

Sensor equivariance: A framework for semantic segmentation with diverse camera models,

H. Reichert, M. Hetzel, A. Hubert, K. Doll, and B. Sick, “Sensor equivariance: A framework for semantic segmentation with diverse camera models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1254–1261, 2024

2024

[23] [23]

Adapting cnns for fisheye cameras without retraining,

R. Griffiths and D. G. Dansereau, “Adapting cnns for fisheye cameras without retraining,”arXiv preprint arXiv:2404.08187, 2024

work page arXiv 2024

[24] [24]

Convolution kernel adaptation to calibrated fisheye,

B. Berenguel-Baeta, M. Santos-Villafranca, J. Bermudez-Cameo, A. P. Yus, and J. Guerrero, “Convolution kernel adaptation to calibrated fisheye,” in34th British Machine Vision Conference 2023, BMVC 2023, Aberdeen, UK, November 20-24, 2023, BMV A, 2023

2023

[25] [25]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, 2018

2018

[26] [26]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

2018

[27] [27]

Petrv2: A unified framework for 3d perception from multi-camera images,

Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “Petrv2: A unified framework for 3d perception from multi-camera images,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 3262–3272, 2023

2023

[28] [28]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020

2020

[29] [29]

Benchmarking multi-view bev object detection with mixed pinhole and fisheye cameras,

X. Liu and H. Shen, “Benchmarking multi-view bev object detection with mixed pinhole and fisheye cameras,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), IEEE,

[30] [30]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

MMDetection3D: OpenMMLab next-generation platform for general 3D object detection

M. Contributors, “MMDetection3D: OpenMMLab next-generation platform for general 3D object detection.”https://github.com/ open-mmlab/mmdetection3d, 2020

2020