pith. sign in

arxiv: 2508.02127 · v3 · pith:HY5TVU3Mnew · submitted 2025-08-04 · 💻 cs.CV

Enhancing Event-based Object Detection with Monocular Normal Maps

Pith reviewed 2026-05-22 12:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based object detectionsurface normal mapsmultimodal fusionautonomous drivinggeometric priorstrimodal networkNRE-Net
0
0 comments X

The pith

RGB-derived surface normal maps supply geometric priors that improve event-based object detection under difficult lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Event cameras resist illumination changes but produce dense misleading signals from reflections and sudden contrast shifts. The authors derive surface normal maps from RGB images to supply stable low-frequency structural information that remains available even when RGB quality drops. They build NRE-Net, a trimodal network that fuses these normals with RGB appearance and event dynamics through two dedicated fusion modules. Experiments on driving datasets show the added priors deliver measurable gains over dual-modal and prior fusion baselines.

Core claim

Surface normal maps extracted from monocular RGB images act as explicit geometric constraints that assist event-based object detection. The NRE-Net framework first aligns geometric and appearance cues with the Adaptive Dual-stream Fusion Module, then selectively integrates high-frequency event dynamics with the Event-modality Aware Fusion Module. This trimodal integration yields a 3.0% AP50 improvement over dual-modal baselines and outperforms SFNet and SODFormer on DSEC-Det-sub and PKU-DAVIS-SOD.

What carries the argument

NRE-Net trimodal network that uses the Adaptive Dual-stream Fusion Module to align normal maps with RGB and the Event-modality Aware Fusion Module to incorporate event information, with normal maps providing the structural priors.

If this is right

  • Geometric priors from normals deliver an additional 3.0% AP50 over dual-modal event-plus-RGB baselines.
  • The trimodal system outperforms SFNet by 2.7% and SODFormer by 7.1% on the evaluated autonomous-driving datasets.
  • Normal maps help suppress misleading event signals triggered by reflections and contrast changes.
  • The approach remains effective when RGB quality is reduced because the normals retain low-frequency structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-frequency geometric priors could be tested in other event-based tasks such as segmentation or optical flow.
  • Deriving normals from sources other than RGB might further increase robustness when RGB is unavailable.
  • Real-time vehicle systems could use this fusion to maintain detection accuracy across wider ranges of lighting without requiring perfectly exposed RGB frames.

Load-bearing premise

RGB-derived surface normal maps preserve useful low-frequency structural information even when the source RGB image is degraded by illumination problems.

What would settle it

Measure AP50 on DSEC-Det-sub with and without the normal-map input branch; absence of a roughly 3% gain would contradict the central claim.

Figures

Figures reproduced from arXiv: 2508.02127 by Chuang Zhu, Hanqing Liu, Luoping Cui, Mingjie Liu.

Figure 1
Figure 1. Figure 1: Under adverse lighting conditions, distracting objects [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of the proposed NRE-Net. (a) Three parallel branches extract complementary cues from RGB images, event streams, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of detection results in challenging [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Object detection in autonomous driving is frequently compromised by complex illumination. While event cameras offer a robust solution, they are susceptible to sudden contrast changes such as reflections which often trigger dense, misleading event signals. To overcome this, we leverage RGB-derived surface normal maps as explicit geometric constraints. Crucially, even when RGB degrades, they preserve low-frequency structural priors that effectively assist in event-based detection. Consequently, we present NRE-Net, a trimodal framework that integrates structural priors from surface Normal maps, appearance context from RGB images, and high-frequency dynamics from Events. The Adaptive Dual-stream Fusion Module (ADFM) first aligns geometric and appearance cues, followed by the Event-modality Aware Fusion Module (EAFM) which selectively integrates event dynamics. Extensive evaluations on DSEC-Det-sub and PKU-DAVIS-SOD demonstrate that incorporating geometric priors yields an additional 3.0% AP50 gain over dual-modal baselines, while our approach consistently outperforms fusion methods such as SFNet (+2.7%) and SODFormer (+7.1%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes NRE-Net, a trimodal framework for object detection in challenging illumination that fuses RGB-derived monocular surface normal maps (as geometric priors), RGB appearance, and event data. It introduces an Adaptive Dual-stream Fusion Module (ADFM) to align geometric and appearance cues followed by an Event-modality Aware Fusion Module (EAFM) for selective event integration. Experiments on DSEC-Det-sub and PKU-DAVIS-SOD report a 3.0% AP50 gain over dual-modal baselines and consistent outperformance of SFNet (+2.7%) and SODFormer (+7.1%).

Significance. If the central assumption holds, the work offers a practical route to leverage geometric priors for robust event-based detection when RGB degrades. The reported empirical gains on named datasets are concrete and address a real autonomous-driving pain point; however, the absence of direct validation on normal-map fidelity under the same adverse conditions that produce dense misleading events limits how strongly the results can be interpreted as evidence for the geometric-prior mechanism.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claim that 'even when RGB degrades, [normals] preserve low-frequency structural priors' is load-bearing for the entire contribution, yet the manuscript supplies no quantitative check (e.g., normal estimation error or cosine similarity) on the degraded illumination/reflection subsets of DSEC-Det-sub or PKU-DAVIS-SOD. Without this, it is impossible to attribute the 3.0% AP50 uplift specifically to the geometric signal rather than to other fusion effects.
  2. [§3.2] §3.2 (ADFM and EAFM): the modules are presented as selectively exploiting the normal priors, but no ablation or robustness analysis is given for the case in which the monocular estimator produces inaccurate normals under the same illumination changes that trigger dense events. This leaves open whether the reported gains would survive realistic normal-map noise.
minor comments (2)
  1. [Abstract] The abstract states concrete percentage improvements but omits any mention of the normal-map estimator used or the training protocol; adding one sentence would improve reproducibility assessment.
  2. [Figure 2] Figure captions for the fusion-module diagrams would benefit from explicit notation of input/output tensor shapes to match the text description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical relevance of leveraging geometric priors in event-based detection under challenging illumination. We address each major comment below and will incorporate revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim that 'even when RGB degrades, [normals] preserve low-frequency structural priors' is load-bearing for the entire contribution, yet the manuscript supplies no quantitative check (e.g., normal estimation error or cosine similarity) on the degraded illumination/reflection subsets of DSEC-Det-sub or PKU-DAVIS-SOD. Without this, it is impossible to attribute the 3.0% AP50 uplift specifically to the geometric signal rather than to other fusion effects.

    Authors: We agree that a direct quantitative assessment of normal-map fidelity on the adverse subsets would allow stronger attribution of the observed gains to the geometric priors. In the revised manuscript we will add a new analysis (new table and discussion in §4) that reports proxy measures of normal quality—such as consistency with depth-derived normals where available and visual inspection of low-frequency structure preservation—on the illumination-degraded and reflection-heavy subsets of both datasets. We will also correlate these observations with the per-scene detection improvements to better isolate the contribution of the normal stream. revision: yes

  2. Referee: [§3.2] §3.2 (ADFM and EAFM): the modules are presented as selectively exploiting the normal priors, but no ablation or robustness analysis is given for the case in which the monocular estimator produces inaccurate normals under the same illumination changes that trigger dense events. This leaves open whether the reported gains would survive realistic normal-map noise.

    Authors: This is a fair point on the robustness of the proposed fusion modules. We will add a dedicated ablation study in the revised §4 that injects controlled noise (Gaussian perturbations at varying levels) into the input normal maps and re-evaluates ADFM and EAFM performance. The results will quantify how detection accuracy degrades as normal quality decreases and will demonstrate that the selective integration mechanisms in both modules retain benefit even under moderate normal-map inaccuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential predictions

full rationale

The paper introduces NRE-Net as a trimodal fusion architecture (Normal + RGB + Events) with modules ADFM and EAFM, but presents no equations, first-principles derivations, or parameter-fitting steps that could reduce to inputs by construction. Performance claims (e.g., +3.0% AP50) are framed exclusively as outcomes of experiments on DSEC-Det-sub and PKU-DAVIS-SOD. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear in the provided text; the geometric-prior assumption is stated as a hypothesis validated by results rather than defined into existence. This is a standard empirical CV paper whose central claims remain externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that monocular normal maps supply useful structural priors under RGB degradation; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption RGB-derived surface normal maps preserve low-frequency structural priors that effectively assist in event-based detection even when RGB degrades.
    Presented as the key reason normal maps help when event signals are misleading.

pith-pipeline@v0.9.0 · 5713 in / 1259 out tokens · 53676 ms · 2026-05-22T12:21:27.192143+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    RE-VLM is the first dual-stream VLM combining RGB and event data with a graph-based pipeline to generate training captions and QA pairs, showing gains over RGB-only and event-only models on new datasets for challengin...

  2. Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE

    cs.CV 2026-04 unverdicted novelty 6.0

    Hyper-FEOD fuses RGB and event data via sparse hypergraph cross-modal fusion and region-specialized MoE experts to improve accuracy-efficiency in object detection.

  3. RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    RE-VLM fuses RGB and event data in a dual-stream VLM with a graph-based pipeline for generating training captions and QA pairs, plus two new datasets, showing gains over RGB-only and event-only baselines especially in...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Rethinking induc- tive biases for surface normal estimation

    Gwangbin Bae and Andrew J Davison. Rethinking induc- tive biases for surface normal estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9535–9545, 2024. 2, 4, 8

  2. [2]

    Iron- depth: Iterative refinement of single-view depth using surface normal and its uncertainty

    Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Iron- depth: Iterative refinement of single-view depth using surface normal and its uncertainty. arXiv preprint arXiv:2210.03676,

  3. [3]

    Phantom braking in automated 8 vehicles: A theoretical outline and cycling simulator demon- stration

    Siri Hegna Berge, JCF de Winter, Yan Feng, MP Hagenzieker, and Marjan Hagenzieker. Phantom braking in automated 8 vehicles: A theoretical outline and cycling simulator demon- stration. 2024. 2

  4. [4]

    Chasing day and night: Towards robust and efficient all-day object detection guided by an event cam- era

    Jiahang Cao, Xu Zheng, Yuanhuiyi Lyu, Jiaxu Wang, Renjing Xu, and Lin Wang. Chasing day and night: Towards robust and efficient all-day object detection guided by an event cam- era. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9026–9032. IEEE, 2024. 3, 6

  5. [5]

    Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion

    Nicholas FY Chen. Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages 644–653, 2018. 3

  6. [6]

    A large scale event-based detection dataset for automotive,

    Pierre De Tournemire, Davide Nitti, Etienne Perot, Davide Migliore, and Amos Sironi. A large scale event-based detec- tion dataset for automotive. arXiv preprint arXiv:2001.08499,

  7. [7]

    Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

    Di Feng, Christian Haase-Sch ¨utz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020. 2

  8. [8]

    Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. In European Conference on Computer Vision, pages 241–258. Springer, 2024. 2

  9. [9]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 6

  10. [10]

    Recurrent vision transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

  11. [11]

    Recurrent vision transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13884–13893, 2023. 2

  12. [12]

    Dsec: A stereo event camera dataset for driving scenarios

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios. IEEE Robotics and Automation Letters, 2021. 5

  13. [13]

    Haines and Richard C

    T. Haines and Richard C. Wilson. Combining shape-from- shading and stereo using gaussian-markov random fields. 2008 19th International Conference on Pattern Recognition, pages 1–4, 2008. 2

  14. [14]

    Miyazaki, and S

    Shuhei Hashimoto, D. Miyazaki, and S. Hiura. Uncalibrated photometric stereo constrained by intrinsic reflectance image and shape from silhoutte. 2019 16th International Conference on Machine Vision Applications (MVA), pages 1–6, 2019. 2

  15. [15]

    Revisiting single image depth estimation: Toward higher res- olution maps with accurate object boundaries

    Junjie Hu, Mete Ozay, Yan Zhang, and Takayuki Okatani. Revisiting single image depth estimation: Toward higher res- olution maps with accurate object boundaries. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1043–1051. IEEE, 2019. 2

  16. [16]

    ultralytics/yolov5: v3

    Glenn Jocher, Alex Stoken, Jirka Borovec, Liu Changyu, Adam Hogan, Laurentiu Diaconu, Jake Poznanski, Lijun Yu, Prashant Rai, Russ Ferriday, et al. ultralytics/yolov5: v3. 0. Zenodo, 2020. 6

  17. [17]

    Johnson and E

    Micah K. Johnson and E. Adelson. Shape estimation in natural illumination. CVPR 2011, pages 2553–2560, 2011. 2

  18. [18]

    Nor- mal assisted stereo depth estimation

    Uday Kusupati, Shuo Cheng, Rui Chen, and Hao Su. Nor- mal assisted stereo depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 2189–2199, 2020. 2

  19. [19]

    Sodformer: Streaming object detection with transformer using events and frames

    Dianze Li, Yonghong Tian, and Jianing Li. Sodformer: Streaming object detection with transformer using events and frames. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023. 1, 2, 5, 6

  20. [20]

    Event-assisted low-light video object segmentation

    Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, and Xiaoyan Sun. Event-assisted low-light video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3250–3259, 2024. 1

  21. [21]

    Event-based vision enhanced: A joint detection framework in autonomous driving

    Jianing Li, Siwei Dong, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Event-based vision enhanced: A joint detection framework in autonomous driving. In 2019 ieee international conference on multimedia and expo (icme), pages 1396–1401. IEEE, 2019. 1, 3

  22. [22]

    Exploring plain vision transformer backbones for object de- tection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object de- tection. In European conference on computer vision, pages 280–296. Springer, 2022. 1

  23. [23]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 6

  24. [24]

    Motion robust high-speed light-weighted object detection with event camera

    Bingde Liu, Chang Xu, Wen Yang, Huai Yu, and Lei Yu. Motion robust high-speed light-weighted object detection with event camera. IEEE Transactions on Instrumentation and Measurement, 72:1–13, 2023. 2

  25. [25]

    An atten- tion fusion network for event-based vehicle object detection

    Mengyun Liu, Na Qi, Yunhui Shi, and Baocai Yin. An atten- tion fusion network for event-based vehicle object detection. In 2021 IEEE International Conference on Image Processing (ICIP), pages 3363–3367. IEEE, 2021. 3

  26. [26]

    Enhancing traffic object detection in variable illumination with rgb-event fusion

    Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, and Fei-Yue Wang. Enhancing traffic object detection in variable illumination with rgb-event fusion. IEEE Trans- actions on Intelligent Transportation Systems, 2024. 1, 2, 5, 6

  27. [27]

    Wonder3d: Single im- age to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single im- age to 3d using cross-domain diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

  28. [28]

    Multi-bracket high dynamic range imaging with event cameras

    Nico Messikommer, Stamatios Georgoulis, Daniel Gehrig, Stepan Tulyakov, Julius Erbach, Alfredo Bochicchio, Yuanyou Li, and Davide Scaramuzza. Multi-bracket high dynamic range imaging with event cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 547–557, 2022. 3

  29. [29]

    3d object detection with normal-map on point clouds

    Jishu Miao, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. 3d object detection with normal-map on point clouds. In VISIGRAPP (5: VISAPP), pages 569–576,

  30. [30]

    Phantom braking in advanced driver assistance systems

    Claudia Trinidad Moscoso Paredes, Trond Foss, and Gun- nar Jenssen. Phantom braking in advanced driver assistance systems. driver experience and car manufacturer warnings in owner manuals. SINTEF rapport; 2021: 00482, 2021. 2

  31. [31]

    Scaramuzza

    Manasi Muglikar, Diederik Paul Moeys, and D. Scaramuzza. Event guided depth sensing. 2021 International Conference on 3D Vision (3DV), pages 385–393, 2021. 2

  32. [32]

    Robust method for removing dynamic objects from point clouds

    Shishir Pagad, Divya Agarwal, Sathya Narayanan, Kasturi Rangan, Hyungjin Kim, and Ganesh Yalla. Robust method for removing dynamic objects from point clouds. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 10765–10771. IEEE, 2020. 2

  33. [33]

    Learning to detect objects with a 1 megapixel event camera

    Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera. Advances in Neural Information Processing Systems, 33:16639–16652, 2020. 5

  34. [34]

    So, Jun Hwangbo, Sang Hyun Kim, and I

    J. So, Jun Hwangbo, Sang Hyun Kim, and I. Yun. Analysis on autonomous vehicle detection performance according to various road geometry settings. Journal of Intelligent Trans- portation Systems, 27:384 – 395, 2022. 2

  35. [35]

    Event-based fusion for motion deblurring with cross-modal attention

    Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. In European conference on computer vision, pages 412–428. Springer, 2022. 3

  36. [36]

    Fusing event- based and rgb camera for robust object detection in adverse conditions

    Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event- based and rgb camera for robust object detection in adverse conditions. In 2022 International conference on robotics and automation (ICRA), pages 933–939. IEEE, 2022. 3, 6

  37. [37]

    Depth estimation from image structure

    Antonio Torralba and Aude Oliva. Depth estimation from image structure. IEEE Transactions on pattern analysis and machine intelligence, 24(9):1226–1238, 2002. 2

  38. [38]

    Time lens++: Event-based frame interpolation with paramet- ric non-linear flow and multi-scale fusion

    Stepan Tulyakov, Alfredo Bochicchio, Daniel Gehrig, Sta- matios Georgoulis, Yuanyou Li, and Davide Scaramuzza. Time lens++: Event-based frame interpolation with paramet- ric non-linear flow and multi-scale fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17764, 2022. 3

  39. [39]

    Sparsity invariant cnns

    Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017. 4

  40. [40]

    YOLOv7: Trainable bag-of-freebies sets new state- of-the-art for real-time object detectors

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. YOLOv7: Trainable bag-of-freebies sets new state- of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6

  41. [41]

    Dual memory aggregation network for event-based object detection with learnable representation

    Dongsheng Wang, Xu Jia, Yang Zhang, Xinyu Zhang, Yaoyuan Wang, Ziyang Zhang, Dong Wang, and Huchuan Lu. Dual memory aggregation network for event-based object detection with learnable representation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2492–2500,

  42. [42]

    Drive like a machine: Remembering the origin and goal of autonomous driving and intelligent vehicles.IEEE Transactions on Intelligent Vehicles, 8(7):3763–3766, 2023

    Fei-Yue Wang. Drive like a machine: Remembering the origin and goal of autonomous driving and intelligent vehicles.IEEE Transactions on Intelligent Vehicles, 8(7):3763–3766, 2023. 3

  43. [43]

    Kd-tree based nonuni- form simplification of 3d point cloud

    Zhaoxia Xiao and Wenming Huang. Kd-tree based nonuni- form simplification of 3d point cloud. In 2009 Third Interna- tional Conference on Genetic and Evolutionary Computing, pages 339–342. IEEE, 2009. 2

  44. [44]

    Econ: Explicit clothed humans optimized via normal integration

    Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and Michael J Black. Econ: Explicit clothed humans optimized via normal integration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 512–523, 2023. 2

  45. [45]

    Rope3d: The road- side perception dataset for autonomous driving and monocular 3d object detection task

    Xiaoqing Ye, Mao Shu, Hanyu Li, Yifeng Shi, Yingying Li, Guangjie Wang, Xiao Tan, and Errui Ding. Rope3d: The road- side perception dataset for autonomous driving and monocular 3d object detection task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 21341–21350, 2022. 1

  46. [46]

    arXiv preprint arXiv:2310.06347 , year=

    Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, and Yao Yao. Joint- net: Extending text-to-image diffusion for dense distribution modeling. arXiv preprint arXiv:2310.06347, 2023. 2

  47. [47]

    Completionformer: Depth completion with convolutions and vision transformers

    Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, and Stefano Mattoccia. Completionformer: Depth completion with convolutions and vision transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2023. 4

  48. [48]

    Mrpt: Millimeter-wave radar- based pedestrian trajectory tracking for autonomous urban driving

    Zhenyuan Zhang, Xiaojie Wang, Darong Huang, Xin Fang, Mu Zhou, and Ying Zhang. Mrpt: Millimeter-wave radar- based pedestrian trajectory tracking for autonomous urban driving. IEEE Transactions on Instrumentation and Measure- ment, 71:1–17, 2021. 2

  49. [49]

    Detrs beat yolos on real-time object detection

    Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen. Detrs beat yolos on real-time object detection. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16965–16974, 2024. 1

  50. [50]

    Mffenet: Multiscale feature fusion and en- hancement network for rgb–thermal urban road scene parsing

    Wujie Zhou, Xinyang Lin, Jingsheng Lei, Lu Yu, and Jenq- Neng Hwang. Mffenet: Multiscale feature fusion and en- hancement network for rgb–thermal urban road scene parsing. IEEE Transactions on Multimedia, 24:2526–2538, 2021. 3

  51. [51]

    Rgb-event fusion for moving object detection in autonomous driving

    Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. In 2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7808–7815. IEEE, 2023. 1, 2, 3, 6

  52. [52]

    Visual prompt multi-modal tracking

    Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9516–9526, 2023. 1

  53. [53]

    Nicer- slam: Neural implicit scene encoding for rgb slam

    Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. Nicer- slam: Neural implicit scene encoding for rgb slam. In 2024 International Conference on 3D Vision (3DV), pages 42–52. IEEE, 2024. 2

  54. [54]

    Object detection in 20 years: A survey

    Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. Proceed- ings of the IEEE, 111(3):257–276, 2023. 1 10