pith. machine review for the scientific record. sign in

arxiv: 2604.19411 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords bird's-eye-viewsemantic mappingaerial supervisionego-centric sensorsdynamic scenespseudo-labelingBEV segmentationroad scene understanding
0
0 comments X

The pith

GOLD-BEV learns dense BEV semantic maps of dynamic road scenes from ego-centric sensors by training with time-synchronized aerial imagery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GOLD-BEV, a framework that produces dense bird's-eye-view semantic maps including moving agents using only vehicle-mounted sensors at test time. It achieves this by restricting aerial imagery to the training stage, where strict time synchronization supplies dense targets that ground views alone cannot provide reliably. BEV-aligned aerial crops enable scalable pseudo-label generation via adapted teacher models, and the system further learns to synthesize aerial views from ego data for labeling drives outside aerial coverage. This setup addresses the difficulty of annotating dynamic elements and temporal inconsistencies that arise in ego-only BEV approaches. The result supports applications that require geometrically consistent scene representations for planning without ongoing access to overhead data.

Core claim

GOLD-BEV learns dense bird's-eye-view semantic environment maps including dynamic agents from ego-centric sensors by employing time-synchronized aerial imagery exclusively as supervision during training. The use of BEV-aligned aerial crops facilitates dense semantic annotation with little manual effort and resolves ambiguities present in ego-only labeling. Strict synchronization further enables supervision of moving traffic participants while mitigating temporal inconsistencies. Domain-adapted aerial teachers generate the dense targets, and joint training includes optional pseudo-aerial reconstruction for interpretability. The model additionally learns to synthesize pseudo-aerial BEV images,

What carries the argument

Strictly time-synchronized aerial-ground data pairs that generate dense BEV pseudo-labels from domain-adapted aerial teacher models, extended by synthesis of pseudo-aerial BEV images from ego sensors for areas beyond aerial coverage.

Load-bearing premise

Time-synchronized aerial-ground data pairs can be obtained at scale and domain-adapted aerial teachers produce reliable dense pseudo-labels for dynamic scenes without errors that propagate into the final BEV model.

What would settle it

Train two versions of the BEV segmentation model on the same ego data, one with and one without the synchronized aerial pseudo-labels, then measure IoU on dynamic agent classes in a test set that has independent ground-truth BEV annotations; a substantial drop without aerial supervision would support the central claim.

Figures

Figures reproduced from arXiv: 2604.19411 by Alaa Eddine Ben Zekri, Franz Kurz, Houda Chaabouni-Chouayakh, Joshua Niemeijer, Philipp M. Schm\"alzle, Reza Bahmanyar.

Figure 1
Figure 1. Figure 1: Cross-view aerial supervision for BEV semantic mapping. A helicopter-mounted camera records high-resolution overhead RGB imagery (a), time-synchronized with an instrumented car that captures a forward-facing RGB view and LiDAR sweeps (b). By geo-aligning the aerial imagery to the vehicle frame, we obtain BEV-aligned crops and dense semantic targets that supervise BEV map prediction from ego sensors (c). De… view at source ↗
Figure 2
Figure 2. Figure 2: Vehicle setup used for the campaign and sensor installations (a). Trajectory of the data and how it is split into a train val test split (b). Examples for the input (camera and LiDAR), the corresponding aerial crops with the centered ego vehicle and the derived labels from the aerial view (c). with additional cleaning/annotation effort, e.g. OpenOccupancy [32]. While ef￾fective, these signals are tied to t… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of BEV semantic segmentation predictions. Each row shows a different scenario: a highway scene (top), a complex urban intersection (middle), and a zoomed-in crop highlighting a VRU prediction (bottom). From left to right: ground truth (GT), LiDAR-only prediction (L-only), camera+LiDAR prediction (C+L), and camera+LiDAR with additional sparse LiDAR fine-tuning (C+L + sparse). VRU IoU 0.084 → 0.29… view at source ↗
Figure 4
Figure 4. Figure 4: Views used for BEV annotation and validation. Left: real BEV image used to create the Gold manual GT. Middle/right: pseudo-aerial BEV reconstructions from SegFormer and diffusion. LiDAR returns are overlaid (red) to indicate cells with direct sensor support. Annotators label SegFormer/diffusion views independently; the final mask is a fusion that retains only pixels where both views agree on the class. aer… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of our student model. The input consists of (i) a LiDAR point cloud rasterized into a BEV representation with occupancy, height, and density channels, and (ii) a front-facing ground-view RGB image. Each modality is first encoded by a SegFormer backbone. The resulting multi-scale features are fused by cross-attention, allowing the model to combine complementary information despite the different vie… view at source ↗
Figure 6
Figure 6. Figure 6: Starting from two supervision sources—a labeled target subset of BEV-aligned helicopter crops and external supervised data—we train two domain-adapted aerial teachers: a structural Mask2Former for coarse scene layout and a pedestrian-specific Mask2Former for dynamic agents. Applied to each BEV crop, the structural teacher predicts per-pixel semantic labels and confidences, while the pedestrian teacher pre￾… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the GOLD-BEV dataset. Each row shows one scene. From left to right: front camera, LiDAR BEV raster, ground-truth BEV aerial crop, reconstructed BEV image, pseudo-label BEV segmentation, and predicted segmenta￾tion [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces GOLD-BEV, a framework for learning dense bird's-eye-view (BEV) semantic environment maps—including dynamic agents—from ego-centric sensors. Supervision comes exclusively from time-synchronized aerial imagery during training; BEV-aligned aerial crops serve as dense targets, domain-adapted aerial teachers generate pseudo-labels, and the model is trained with a joint loss that optionally includes pseudo-aerial BEV reconstruction. The approach is extended to synthesize pseudo-aerial BEV images from ego sensors for annotation and uncertainty-aware labeling on unlabeled data.

Significance. If the quantitative claims hold, the work would offer a practical route to scalable dense BEV semantic mapping that handles moving traffic participants without requiring manual ego-only BEV labels. The emphasis on strict aerial-ground synchronization and the teacher-student setup with optional reconstruction loss addresses a recognized difficulty in dynamic-scene BEV segmentation.

major comments (3)
  1. [§3.2] §3.2: The domain-adapted aerial teacher (U-Net fine-tuned via aerial-to-ground adaptation) is presented without any per-class IoU, boundary-error, or motion-specific metrics on dynamic objects (vehicles, pedestrians). Because aerial imagery exhibits distinct motion blur, parallax, and occlusion statistics, residual domain gap directly affects the only dense supervision signal available for moving agents; this omission leaves the central claim about reliable dynamic-agent BEV mapping unsupported.
  2. [Eq. (4)] Eq. (4): The joint training loss combines BEV segmentation with optional pseudo-aerial reconstruction but contains no uncertainty weighting, consistency regularizer, or teacher-error isolation term. Consequently, label noise from the aerial teacher on dynamic classes propagates unchecked into the student BEV head, undermining the claim that ego-centric sensors alone suffice for accurate dense dynamic semantics.
  3. [§4] §4 / experimental section: No ablation studies, error analysis, or quantitative comparison against baselines (e.g., ego-only BEV models or non-synchronized aerial supervision) are reported for dynamic classes. Without these, it is impossible to determine whether the synchronized aerial supervision actually improves BEV accuracy on moving agents or merely reproduces the teacher’s errors.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a single sentence summarizing the key quantitative improvements (e.g., mIoU gains on dynamic classes) once the experiments are added.
  2. [§3.2] Notation for the aerial-to-ground domain adaptation step is introduced without an explicit diagram or equation reference, making the teacher pipeline harder to follow on first reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which help clarify the validation needed for dynamic-agent mapping in GOLD-BEV. We address each major point below, providing clarifications from the manuscript and outlining targeted revisions where they strengthen the claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The domain-adapted aerial teacher (U-Net fine-tuned via aerial-to-ground adaptation) is presented without any per-class IoU, boundary-error, or motion-specific metrics on dynamic objects (vehicles, pedestrians). Because aerial imagery exhibits distinct motion blur, parallax, and occlusion statistics, residual domain gap directly affects the only dense supervision signal available for moving agents; this omission leaves the central claim about reliable dynamic-agent BEV mapping unsupported.

    Authors: We agree that explicit per-class metrics on the aerial teacher would better quantify residual domain gap for dynamic objects. Section 3.2 describes the U-Net adaptation process and its use for pseudo-label generation, while Section 4 reports end-to-end BEV segmentation performance on dynamic agents that exceeds ego-only baselines. To directly address the concern, we will add in the revision per-class IoU, boundary F1, and motion-specific error metrics for vehicles and pedestrians on a held-out aerial validation set, along with qualitative examples of teacher predictions on moving agents. This will explicitly measure the supervision quality for dynamic classes. revision: yes

  2. Referee: [Eq. (4)] Eq. (4): The joint training loss combines BEV segmentation with optional pseudo-aerial reconstruction but contains no uncertainty weighting, consistency regularizer, or teacher-error isolation term. Consequently, label noise from the aerial teacher on dynamic classes propagates unchecked into the student BEV head, undermining the claim that ego-centric sensors alone suffice for accurate dense dynamic semantics.

    Authors: The joint loss in Eq. (4) uses the optional pseudo-aerial BEV reconstruction term precisely as a consistency regularizer: it forces the learned BEV features to reconstruct overhead views, providing implicit isolation of teacher errors through geometric consistency. Strict aerial-ground synchronization (emphasized in the abstract and Section 3) further reduces temporal noise for moving agents, which is the primary source of label error in non-synchronized settings. While explicit uncertainty weighting is absent, the design and reported results on dynamic classes support that noise does not propagate unchecked. We will add a short discussion of this regularization effect and a note on potential extensions with uncertainty in the revised manuscript. revision: partial

  3. Referee: [§4] §4 / experimental section: No ablation studies, error analysis, or quantitative comparison against baselines (e.g., ego-only BEV models or non-synchronized aerial supervision) are reported for dynamic classes. Without these, it is impossible to determine whether the synchronized aerial supervision actually improves BEV accuracy on moving agents or merely reproduces the teacher’s errors.

    Authors: Section 4 presents quantitative BEV semantic segmentation results that include dynamic agents and demonstrates the benefit of the full GOLD-BEV pipeline. However, we acknowledge the value of explicit breakdowns. In the revision we will add: (i) direct comparison against an ego-only BEV baseline (trained without aerial supervision) with per-class metrics on dynamic objects, (ii) ablation removing the reconstruction loss, and (iii) error analysis stratified by dynamic vs. static classes, including qualitative failure cases. These additions will isolate the contribution of synchronized aerial supervision for moving agents. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level framework proposal without equations or self-referential reductions

full rationale

The manuscript presents GOLD-BEV as a training-time supervision framework that uses synchronized aerial imagery to generate pseudo-labels for ego-centric BEV segmentation, followed by optional reconstruction and synthesis extensions. No derivation chain, fitted parameters renamed as predictions, or self-citations appear in the abstract or described pipeline. The approach relies on standard domain adaptation and pseudo-labeling steps whose validity is external to the paper's own outputs; the central claim (ego-only inference after aerial-only training) does not reduce to its inputs by construction and remains falsifiable via held-out evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract invokes standard computer-vision assumptions about multi-view alignment and domain adaptation but introduces no explicit free parameters, new axioms, or invented entities beyond the named framework components.

pith-pipeline@v0.9.0 · 5503 in / 1106 out tokens · 32684 ms · 2026-05-10T03:17:28.939205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    The International Journal of Robotics Research39(12), 1367–1376 (Sep 2020).https://doi.org/10.1177/0278364920961451,http:// dx.doi.org/10.1177/0278364920961451

    Agarwal, S., Vora, A., Pandey, G., Williams, W., Kourous, H., McBride, J.: Ford multi-av seasonal dataset. The International Journal of Robotics Research39(12), 1367–1376 (Sep 2020).https://doi.org/10.1177/0278364920961451,http:// dx.doi.org/10.1177/0278364920961451

  2. [2]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (October 2019)

    Azimi, S.M., Henry, C., Sommer, L., Schumann, A., Vig, E.: Skyscapes fine-grained semantic understanding of aerial scenes. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (October 2019)

  3. [3]

    arXiv preprint arXiv:2203.11496 (2022)

    Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., Tai, C.L.: Transfusion: Ro- bust lidar-camera fusion for 3d object detection with transformers. arXiv preprint arXiv:2203.11496 (2022)

  4. [4]

    In: 2025 Joint Urban Remote Sensing Event (JURSE)

    Ben Zekri, A.E., Latrach, A., Bahmanyar, R., Chaabouni-Chouayakh, H.: Towards using synthetic data in aerial image segmentation. In: 2025 Joint Urban Remote Sensing Event (JURSE). vol. CFP25RSD-ART, pp. 1–4 (2025).https://doi.org/ 10.1109/JURSE60372.2025.11076036

  5. [5]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)

  6. [6]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  7. [7]

    Niemeijer, A

    Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation (2021),https://arxiv.org/abs/ 2112.10764 16 J. Niemeijer, A. E. Ben Zekri, R. Bahmanyar, P. M. Schmälzle et al

  8. [8]

    The International Journal of Robotics Research32(11), 1231–1237 (2013).https://doi.org/10.1177/0278364913491297,https: //journals.sagepub.com/doi/10.1177/02783649134912972

    Geiger, A., Lenz, P., Stiller, C., Urtasun, Raquel: Vision meets robotics: The kitti dataset. International journal of robotics research32(11), 1231–1237 (2013). https://doi.org/10.1177/0278364913491297

  9. [9]

    Scientific Reports16, 5169 (2026).https: //doi.org/10.1038/s41598-026-35551-0

    Guan, Y., Wang, T., Cheng, Q., et al.: Crfusion: a novel lidar-camera fusion network for bev map construction. Scientific Reports16, 5169 (2026).https: //doi.org/10.1038/s41598-026-35551-0

  10. [10]

    Agc-drive: A large-scale dataset for real-world aerial- ground collaboration in driving scenarios.arXiv Preprint arXiv:2506.16371,

    Hou, Y., Zou, B., Zhang, M., Chen, R., Yang, S., Zhang, Y., Zhuo, J., Chen, S., Chen, J., Ma, H.: Agc-drive: A large-scale dataset for real-world aerial-ground collaboration in driving scenarios. arXiv preprint arXiv:2506.16371 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Hu, A., Murez, Z., Mohan, N., Dudas, S., Hawke, J., Badrinarayanan, V., Cipolla, R., Kendall, A.: Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15273–15282 (October 2021)

  12. [12]

    In: Computer Vision – ECCV 2022

    Hu, S., Chen, L., Wu, P., Li, H., Yan, J., Tao, D.: St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13698, pp. 533–549. Springer (2022).https://doi.org/10.1007/978-3-031-19839-7_31

  13. [13]

    Hu, S., Feng, M., Nguyen, R.M.H., Lee, G.H.: Cvm-net: Cross-view matching net- workforimage-basedground-to-aerialgeo-localization.In:ProceedingsoftheIEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

  14. [14]

    arXiv preprint arXiv:1712.02294 (2017)

    Ku,J.,Mozifian,M.,Lee,J.,Harakeh,A.,Waslander,S.L.:Joint3dproposalgener- ation and object detection from view aggregation. arXiv preprint arXiv:1712.02294 (2017)

  15. [15]

    Hdmapnet: An online HD map construction and evaluation framework

    Li, Q., Wang, Y., Wang, Y., Zhao, H.: Hdmapnet: An online hd map construction and evaluation framework. arXiv preprint arXiv:2107.06307 (2021)

  16. [16]

    arXiv preprint arXiv:2206.10092 (2022)

    Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., Li, Z.: Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092 (2022)

  17. [17]

    In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX

    Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y., Dai, J.: BEV- Former: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Lec- ture Notes in Computer Science, vol. 13669, pp. 1...

  18. [18]

    In: The Eleventh International Conference on Learning Representations (ICLR)

    Liao, B., Chen, S., Wang, X., Cheng, T., Zhang, Q., Liu, W., Huang, C.: Maptr: Structured modeling and learning for online vectorized HD map construction. In: The Eleventh International Conference on Learning Representations (ICLR). OpenReview.net (2023),https://openreview.net/forum?id=k7p_YAO7yE

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

    Liu, L., Li, H.: Lending orientation to neural networks for cross-view geo- localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

  20. [20]

    Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfusion: Multi-taskmulti-sensorfusionwithunifiedbird’s-eyeviewrepresentation.In:IEEE International Conference on Robotics and Automation (ICRA). pp. 2774–2781. IEEE (2023).https://doi.org/10.1109/ICRA48891.2023.10160968

  21. [21]

    009 GOLD-BEV 17

    Lyu, Y., Vosselman, G., Xia, G., Yilmaz, A., Yang, M.Y.: Uavid: A semantic seg- mentationdatasetforuavimagery.ISPRSJournalofPhotogrammetryandRemote Sensing165, 108–119 (2020).https://doi.org/10.1016/j.isprsjprs.2020.05. 009 GOLD-BEV 17

  22. [22]

    arXiv preprint arXiv:2002.08394 (2020)

    Mani, K., Daga, S., Garg, S., Shankar, N.S., Jatavallabhula, K.M., Krishna, K.M.: Monolayout: Amodal scene layout from a single image. arXiv preprint arXiv:2002.08394 (2020)

  23. [23]

    In:2021IEEEIntelligent Vehicles Symposium Workshops (IV Workshops)

    Niemeijer, J., Schäfer, J.P.: Combining semantic self-supervision and self-training fordomain adaptationinsemanticsegmentation. In:2021IEEEIntelligent Vehicles Symposium Workshops (IV Workshops). pp. 364–371 (2021).https://doi.org/ 10.1109/IVWorkshops54471.2021.9669255

  24. [24]

    arXiv preprint arXiv:2008.05711 (2020)

    Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. arXiv preprint arXiv:2008.05711 (2020)

  25. [25]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10674– 10685 (2022).https://doi.org/10.1109/CVPR52688.2022.01042

  26. [26]

    IEEE Access 11, 54296–54336 (May 2023)

    Schwonberg, M., Niemeijer, J., Termöhlen, J.A., Schäfer, J.P., Schmidt, N.M., Gottschalk, H., Fingscheidt, T.: Survey on Unsupervised Domain Adaptation for Semantic Segmentation for Visual Perception in Automated Driving. IEEE Access 11, 54296–54336 (May 2023)

  27. [27]

    Sima, C., Tong, W., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., Li, H.: Scene as occupancy (2023)

  28. [28]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving.arXiv preprint arXiv:2304.14365, 2023

    Tian, X., Jiang, T., Yun, L., Wang, Y., Wang, Y., Zhao, H.: Occ3d: A large- scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365 (2023)

  29. [29]

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł.,Polosukhin,I.:Attentionisallyouneed.arXivpreprintarXiv:1706.03762(2017)

  30. [30]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA)

    Wang, S., Zhang, Y., Vora, A., Perincherry, A., Li, H.: Satellite image based cross- view localization for autonomous vehicle. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 3592–3599. IEEE (2023)

  31. [31]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    Wang, S., Bai, M., Mattyus, G., Chu, H., Luo, W., Yang, B., Liang, J., Cheverie, J., Fidler, S., Urtasun, R.: Torontocity: Seeing the world with a million eyes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 3009–3017 (Oct 2017)

  32. [32]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Wang, X., Zhu, Z., Xu, W., Zhang, Y., Wei, Y., Chi, X., Ye, Y., Du, D., Lu, J., Wang, X.: Openoccupancy: A large scale benchmark for surrounding semantic oc- cupancy perception. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17804–17813. IEEE (2023).https://doi.org/10.1109/ICCV51070. 2023.01636

  33. [33]

    arXiv preprint arXiv:2303.09551 (2023)

    Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. arXiv preprint arXiv:2303.09551 (2023)

  34. [34]

    In: Proceedings of the IEEE International Conference on Com- puter Vision (ICCV)

    Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: Proceedings of the IEEE International Conference on Com- puter Vision (ICCV). pp. 3961–3969 (December 2015)

  35. [35]

    In: IEEE International Conference on Computer Vision (ICCV)

    Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial reference imagery. In: IEEE International Conference on Computer Vision (ICCV). pp. 1–9 (2015).https://doi.org/10.1109/ICCV.2015.451, acceptance rate: 30.3%

  36. [36]

    Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE

    Yuan, T., Liu, Y., Wang, Y., Wang, Y., Zhao, H.: Streammapnet: Streaming map- ping network for vectorized online HD map construction. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 7341–7350. IEEE (2024).https://doi.org/10.1109/WACV57701.2024.00719 18 J. Niemeijer, A. E. Ben Zekri, R. Bahmanyar, P. M. Schmälzle et al

  37. [37]

    Adding conditional control to text-to-image diffusion models,

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

  38. [38]

    Zhang, Y., Zhu, Z., Zheng, W., Huang, J., Huang, G., Zhou, J., Lu, J.: Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743 (2022) GOLD-BEV 19 A Supplementary Material Model Inputs [Occ, Height, Dens] Ground-View LiDAR Point Cloud SegFormer Encoder SegFormer Encoder [S1,S2,S3,S...