arxiv: 2605.04355 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making

Kaung Sithua, Maciej Wielgosza, Min Khant Soe Okea, Mustafa Sakhaia

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingsensor fusionevent camerasDVStransformerreinforcement learningCARLAmultimodal perception

0 comments

The pith

Adding event cameras to InterFuser fusion raises route completion to 100 percent on CARLA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the InterFuser architecture by adding Dynamic Vision Sensors as a third input alongside RGB cameras and LiDAR. It introduces a token-based fusion method that feeds accumulated event frames directly into the transformer backbone. The goal is to overcome motion blur and latency problems that plague standard sensors in high-speed or high-contrast driving scenes. A sympathetic reader would care because more reliable multimodal perception could translate into fewer failures when lighting changes rapidly or vehicles move quickly. Evaluation on CARLA Leaderboard tasks shows the extended model reaches a Driving Score of 77.2 while completing routes at 100 percent.

Core claim

The authors claim that integrating DVS data through a novel token-based fusion strategy into the transformer backbone of InterFuser produces a more robust driving policy. This policy leverages the complementary strengths of RGB, LiDAR, and asynchronous event streams to maintain performance where conventional sensors degrade. On the CARLA Leaderboard the resulting agent records a competitive Driving Score of 77.2 and perfect Route Completion of 100 percent, indicating that event-based vision supplies useful information precisely in adverse lighting and dynamic conditions.

What carries the argument

Token-based fusion strategy that converts accumulated event frames into tokens and merges them with RGB and LiDAR tokens inside the transformer backbone of the extended InterFuser model.

If this is right

The agent maintains higher reliability when traditional cameras suffer motion blur at high speeds.
Perception stays effective in scenes with extreme brightness changes where RGB saturates or loses detail.
The same fusion approach can be applied to other reinforcement-learning driving policies that already use transformer backbones.
Event data reduces dependence on perfect synchronization between RGB and LiDAR streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to real-vehicle hardware if event cameras are mounted alongside existing sensors.
Further gains could appear if the fusion tokens are weighted dynamically according to scene conditions.
Similar token fusion could be tested in other domains that need low-latency vision such as drone navigation.
Longer evaluation runs on CARLA might reveal whether the 100 percent route completion holds across more weather and traffic variations.

Load-bearing premise

The token-based fusion method actually succeeds in letting the transformer use event information in a way that measurably improves robustness over RGB-plus-LiDAR alone.

What would settle it

Re-running the same CARLA evaluation with the DVS branch removed and observing that the Driving Score and Route Completion stay statistically unchanged or improve.

Figures

Figures reproduced from arXiv: 2605.04355 by Kaung Sithua, Maciej Wielgosza, Min Khant Soe Okea, Mustafa Sakhaia.

**Figure 1.** Figure 1: Overview of the InterFuserDVS Architecture. The model fuses RGB images, LiDAR point clouds, and DVS event frames. Specific backbones (ResNet for RGB/DVS, PointNet for LiDAR) extract features which are tokenized and processed by a global Transformer Encoder. The Decoder predicts waypoints, traffic light states, and occupancy maps, which are used by the Safety Controller to generate final control commands. T… view at source ↗

**Figure 2.** Figure 2: InterFuserDVS Control Loop: High-level overview of the sensor fusion, transformer processing, and control generation pipeline. Note that the Safety Validation block takes the PID command as input and can override it. where 𝑤𝑝 is the 𝓁1 loss between predicted and ground-truth waypoints. 𝑡𝑟𝑎𝑓 𝑓 𝑖𝑐 is a specialized Multi-View Traffic Loss (MVTL) that enforces consistency between the predicted object density… view at source ↗

**Figure 3.** Figure 3: Safety Validation Logic: The hybrid controller logic that overrides PID proposals based on traffic light states and predicted future occupancy. It specifically checks if the vehicle is close to a junction before enforcing red light stops, preventing deadlocks. • Driving Score (DS): The primary metric, calculated as the product of Route Completion and (1 - Infraction Penalty). • Route Completion (RC): The p… view at source ↗

**Figure 4.** Figure 4: Future Architecture Design: Integrating Spiking Neural Networks (SNN) and Asynchronous Fusion Transformer (AFT). The pipeline preserves the asynchronous nature of DVS events through the SNN backbone, using a Discretizer to align spike features with synchronous RGB and LiDAR data before fusion. Sakhai et al.: Preprint submitted to Elsevier Page 14 of 17 view at source ↗

read the original abstract

Autonomous driving systems rely heavily on robust sensor fusion to perceive complex envi- ronments. Traditional setups using RGB cameras and LiDAR often struggle in high-dynamic- range scenes or high-speed scenarios due to motion blur and latency. Dynamic Vision Sensors (DVS), or event cameras, offer a paradigm shift by capturing asynchronous brightness changes with microsecond temporal resolution and high dynamic range. In this paper, we propose an extended architecture of the state-of-the-art InterFuser model, integrating DVS as an additional modality to enhance perception reliability. We introduce a novel token-based fusion strategy that incorporates accumulated event frames into the transformer-based backbone of InterFuser. Our method leverages the complementary nature of RGB, LiDAR, and DVS data. We evaluate our approach on the Car Learning to Act (CARLA) Leaderboard benchmarks, demonstrating that the inclusion of DVS improves the robustness of the driving agent, achieving a competitive Driving Score of 77.2 and a superior Route Completion of 100%. The results indicate that event-based vision is a promising direction for improving safety and performance in adverse lighting and dynamic conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends InterFuser with token-based DVS fusion and hits 77.2 DS on CARLA, but no ablation isolates what the events actually add.

read the letter

The main takeaway is that the authors bolt event camera input onto the InterFuser transformer via a token fusion step and report a 77.2 driving score plus 100% route completion on the CARLA leaderboard. That puts the combined system in a reasonable range, but the evidence does not yet show the DVS is responsible for any robustness gain. What is new is the specific token-based way they accumulate event frames and feed them as an extra modality alongside RGB and LiDAR. It is a direct, architecture-preserving extension rather than a new backbone or training trick. The paper does well by keeping the original InterFuser structure intact and demonstrating that the addition runs end-to-end on the benchmark without obvious breakage. The results are at least on the public leaderboard, so the numbers can be checked by others. The soft spots are straightforward. There is no ablation that removes only the DVS branch or compares against a matched run of the original InterFuser under identical training conditions. RL driving agents are sensitive to seeds, hyperparameters, and data order, so a single score cannot separate the contribution of the event data from other changes. The abstract also gives no error bars, no multiple runs, and no details on event accumulation, tokenization, or fusion implementation. These gaps make it hard to judge whether the claimed complementarity in high-dynamic-range or high-speed cases actually holds. This paper is for people already working with InterFuser or similar transformer fusion setups who want a quick way to test event cameras in driving. A reader outside that niche gets little reusable method or data. It deserves peer review because the idea is incremental but reasonable and the leaderboard numbers are competitive; the authors can fix the controls and details in revision rather than being rejected outright.

Referee Report

3 major / 1 minor

Summary. The paper proposes InterFuserDVS, an extension of the InterFuser architecture that adds Dynamic Vision Sensors (DVS) as a third modality. It introduces a token-based fusion mechanism to incorporate accumulated event frames into the transformer backbone alongside RGB and LiDAR inputs. The central empirical claim is that this fusion improves robustness in high-dynamic-range and high-speed driving scenarios, as evidenced by a Driving Score of 77.2 and 100% Route Completion on the CARLA Leaderboard.

Significance. If the reported gains can be isolated to the DVS modality through controlled experiments, the work would provide concrete evidence that event-based sensing complements conventional cameras and LiDAR in RL-based autonomous driving, particularly under motion blur and lighting extremes. The absence of such controls currently prevents any assessment of whether the result advances the state of multi-modal fusion.

major comments (3)

[Abstract] Abstract: The reported Driving Score of 77.2 and Route Completion of 100% are presented without any matched baseline run of the original InterFuser or an ablation that removes only the DVS branch while holding architecture, RL training procedure, and CARLA scenario distribution fixed. Given the high variance typical of RL driving agents, a single scalar score does not establish that the token-based fusion produces measurable robustness gains.
[Methods] Methods / fusion description: The token-based fusion strategy is introduced at a conceptual level but supplies no equations, pseudocode, or architectural diagram showing how accumulated event frames are tokenized, embedded, and cross-attended with RGB and LiDAR tokens inside the transformer backbone. Without these details the claim that DVS data are “effectively incorporated” cannot be evaluated.
[Experiments] Experiments: No implementation details, training hyperparameters, random seeds, error bars, statistical tests, or leaderboard comparison table are provided. The manuscript therefore offers no way to determine whether the 77.2 score differs from prior InterFuser results by more than training stochasticity.

minor comments (1)

[Abstract] Abstract: The phrase “competitive Driving Score” is used without stating the current leaderboard rank or the exact metric definitions employed by CARLA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We agree that the current manuscript lacks the necessary controls, technical details, and experimental rigor to substantiate the claimed benefits of DVS integration. We will perform a major revision to address all points raised.

read point-by-point responses

Referee: [Abstract] Abstract: The reported Driving Score of 77.2 and Route Completion of 100% are presented without any matched baseline run of the original InterFuser or an ablation that removes only the DVS branch while holding architecture, RL training procedure, and CARLA scenario distribution fixed. Given the high variance typical of RL driving agents, a single scalar score does not establish that the token-based fusion produces measurable robustness gains.

Authors: We acknowledge the validity of this criticism. The revised manuscript will include a matched baseline evaluation of the original InterFuser under identical training conditions, random seeds, and CARLA scenario distributions. We will also add an ablation study that isolates the DVS branch by removing it while keeping all other components fixed. These additions will enable direct quantification of the contribution of the token-based fusion beyond training stochasticity. revision: yes
Referee: [Methods] Methods / fusion description: The token-based fusion strategy is introduced at a conceptual level but supplies no equations, pseudocode, or architectural diagram showing how accumulated event frames are tokenized, embedded, and cross-attended with RGB and LiDAR tokens inside the transformer backbone. Without these details the claim that DVS data are “effectively incorporated” cannot be evaluated.

Authors: We agree that the fusion mechanism requires precise specification. In the revision we will add the full mathematical formulation for event-frame accumulation, tokenization, embedding, and the cross-attention equations within the transformer. We will also include pseudocode for the fusion module and a detailed architectural diagram that illustrates the data flow from DVS events through to the shared backbone. revision: yes
Referee: [Experiments] Experiments: No implementation details, training hyperparameters, random seeds, error bars, statistical tests, or leaderboard comparison table are provided. The manuscript therefore offers no way to determine whether the 77.2 score differs from prior InterFuser results by more than training stochasticity.

Authors: We accept this observation. The revised experiments section will report complete implementation details, the full set of training hyperparameters, the number of random seeds, standard deviations across runs, results of statistical significance tests, and a comparison table that places the 77.2 score against the original InterFuser and other published CARLA leaderboard entries under consistent evaluation protocols. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical simulation results with no derivations or self-referential reductions.

full rationale

The manuscript describes an architectural extension of InterFuser that adds a DVS branch and a token-based fusion module, then reports CARLA leaderboard metrics (Driving Score 77.2, Route Completion 100%). No equations, parameter-fitting procedures, or derivation steps appear in the provided text. The central claim rests on direct simulation outcomes rather than any quantity that is defined in terms of itself or obtained by renaming a fitted input. Self-citations, if present, are not load-bearing for any mathematical result. The evaluation chain is therefore self-contained against the external CARLA benchmark and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions in multi-modal sensor fusion for driving; no free parameters, new axioms beyond common practice, or invented entities are introduced in the abstract.

axioms (1)

domain assumption RGB, LiDAR, and DVS modalities are complementary for perception in adverse conditions
Invoked when stating that the method leverages their complementary nature.

pith-pipeline@v0.9.0 · 5511 in / 1175 out tokens · 88020 ms · 2026-05-08T16:57:19.438760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Expert drivers for autonomous driving

, 2021. Expert drivers for autonomous driving. urlhttps://kait0.github.io/files/master_thesis_bernhard_jaeger.pdf

2021
[2]

End to End Learning for Self-Driving Cars

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al., 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 URL:https://arxiv.org/abs/1604.07316

work page internal anchor Pith review arXiv 2016
[3]

arXiv preprint arXiv:1903.11027 (2019)

Caesar,H.,Bankiti,V.,Lang,A.H.,Vora,S.,Liong,V.E.,Xu,Q.,Krishnan,A.,Pan,Y.,Baldan,G.,Beijbom,O.,2019.nuscenes:Amultimodal dataset for autonomous driving. CoRR abs/1903.11027. URL:http://arxiv.org/abs/1903.11027,arXiv:1903.11027

work page arXiv 2019
[4]

GRI:GeneralReinforcedImitationanditsApplicationtoVision- Based Autonomous Driving,

Chekroun, R., Toromanoff, M., Hornauer, S., Moutarde, F., 2021. Gri: General reinforced imitation and its application to vision-based autonomous driving. arXiv preprint arXiv:2111.08575

work page arXiv 2021
[5]

Learningtodrivefromaworldonrails,in:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision, pp

Chen,D.,Koltun,V.,Krähenbühl,P.,2021. Learningtodrivefromaworldonrails,in:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision, pp. 15590–15599

2021
[6]

Learning from all vehicles, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, D., Krähenbühl, P., 2022. Learning from all vehicles, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17222–17231

2022
[7]

Learning by Cheating,

Chen, D., Zhou, B., Koltun, V., Krähenbühl, P., 2020. Learning by cheating, in: Conference on Robot Learning, PMLR. pp. 66–75. URL: https://arxiv.org/abs/1912.12294

work page arXiv 2020
[8]

End-to-end autonomous driving: Challenges and frontiers

Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H., 2024. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927 URL:https://arxiv.org/abs/2306.16927,arXiv:2306.16927

work page arXiv 2024
[9]

Multi-view 3d object detection network for autonomous driving

Chen, X., Ma, H., Wan, J., Li, B., Xia, T., 2016. Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759. URL:http://arxiv.org/abs/1611.07759,arXiv:1611.07759

work page arXiv 2016
[10]

Neat: Neural attention fields for end-to-end autonomous driving, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Chitta, K., Prakash, A., Geiger, A., 2021. Neat: Neural attention fields for end-to-end autonomous driving, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15793–15803. URL:https://arxiv.org/abs/2109.04456

work page arXiv 2021
[11]

Transfuser: Imitation with transformer-based sensor fu- sion for autonomous driving,

Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., Geiger, A., 2022. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. arXiv preprint arXiv:2205.15997 URL:https://arxiv.org/abs/2205.15997,arXiv:2205.15997

work page arXiv 2022
[12]

End-to-end driving via conditional imitation learn- ing,

Codevilla, F., Müller, M., López, A., Koltun, V., Dosovitskiy, A., 2018. End-to-end driving via conditional imitation learning, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 4693–4700. URL:https://arxiv.org/abs/1710.02410

work page arXiv 2018
[13]

Exploring the limitations of behavior cloning for autonomous driving, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp

Codevilla, F., Santana, E., Lopez, A., Gaidon, A., 2019. Exploring the limitations of behavior cloning for autonomous driving, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9328–9337. doi:10.1109/ICCV.2019.00942

work page doi:10.1109/iccv.2019.00942 2019
[14]

Deepreinforcementandilforautonomousdriving:Areviewinthecarlasimulation environment

Czechowski,P.,Kawa,B.,Sakhai,M.,Wielgosz,M.,2025. Deepreinforcementandilforautonomousdriving:Areviewinthecarlasimulation environment. Applied Sciences 15, 8972. URL:https://doi.org/10.3390/app15168972, doi:10.3390/app15168972

work page doi:10.3390/app15168972 2025
[15]

Loihi: A neuromorphic manycore processor with on-chip learning

Davies, M., Srinivasa, N., Lin, T.H., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C.K., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y.H., Wild, A., Yang, Y., Wang, H., 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 82–99. ...

work page doi:10.1109/mm.2018.112130359 2018
[16]

Temporal efficient training of spiking neural network via gradient re-weighting

Deng, S., Li, Y., Zhang, S., Gu, S., 2022. Temporal efficient training of spiking neural network via gradient re-weighting. URL:https: //openreview.net/forum?id=_XNtisL32jv

2022
[17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit,J.,Houlsby,N.,2020. Animageisworth16x16words:Transformersforimagerecognitionatscale. CoRRabs/2010.11929. URL: https://arxiv.org/abs/2010.11929,arXiv:2010.11929

work page internal anchor Pith review arXiv 2020
[18]

CARLA: An open urban driving simulator, in: Levine, S., Vanhoucke, V., Goldberg, K

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V., 2017. CARLA: An open urban driving simulator, in: Levine, S., Vanhoucke, V., Goldberg, K. (Eds.), Proceedings of the 1st Annual Conference on Robot Learning, PMLR. pp. 1–16. URL:https://proceedings. mlr.press/v78/dosovitskiy17a.html

2017
[19]

Fang,W.,Yu,Z.,Chen,Y.,Masquelier,T.,Huang,T.,Tian,Y.,2021. IncorporatingLearnableMembraneTimeConstanttoEnhanceLearning of Spiking Neural Networks , in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 2641–2651. URL:https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00266, doi:10. ...

work page doi:10.1109/iccv48922.2021.00266 2021
[20]

Event-based vision: A survey

Gallego, G., Delbruck, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D., 2019. Event-based vision: A survey. CoRR abs/1904.08405. URL:http://arxiv.org/abs/1904.08405, arXiv:1904.08405

work page arXiv 2019
[21]

Combining events and frames using recurrent asyn- chronous multimodal networks for monocular depth prediction

Gehrig, D., Rüegg, M., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D., 2021. Combining events and frames using recurrent asyn- chronous multimodal networks for monocular depth prediction. CoRR abs/2102.09320. URL:https://arxiv.org/abs/2102.09320, arXiv:2102.09320

work page arXiv 2021
[22]

The International Journal of Robotics Research32(11), 1231–1237 (2013).https://doi.org/10.1177/0278364913491297,https: //journals.sagepub.com/doi/10.1177/02783649134912972

Geiger, A., Lenz, P., Stiller, C., Urtasun, R., 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 1231–1237. URL:https://doi.org/10.1177/0278364913491297, doi:10.1177/0278364913491297, arXiv:https://doi.org/10.1177/0278364913491297

work page doi:10.1177/0278364913491297 2013
[23]

Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

2016
[24]

V2E: from video frames to realistic DVS event camera streams

Hu, Y., Liu, S., Delbruck, T., 2020. V2E: from video frames to realistic DVS event camera streams. CoRR abs/2006.07722. URL: https://arxiv.org/abs/2006.07722,arXiv:2006.07722

work page arXiv 2020
[25]

Planning-oriented autonomous driving

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H., 2023. Planning-oriented autonomous driving. arXiv preprint arXiv:2212.10156 URL:https://arxiv.org/abs/2212.10156, arXiv:2212.10156

work page arXiv 2023
[26]

Bevdet: High-performance multi-camera 3d object detection in bird-eye-view.arXiv preprint arXiv:2112.11790, 2021

Huang, J., Huang, G., Zhu, Z., Du, D., 2021. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. CoRR abs/2112.11790. URL:https://arxiv.org/abs/2112.11790,arXiv:2112.11790

work page arXiv 2021
[27]

Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art

Janai, J., Güney, F., Behl, A., Geiger, A., 2017. Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art. CoRR abs/1704.05519. URL:http://arxiv.org/abs/1704.05519,arXiv:1704.05519. Sakhai et al.:Preprint submitted to ElsevierPage 15 of 17 InterFuserDVS

work page arXiv 2017
[28]

Jiang, S

Jiang,B.,Chen,S.,Xu,Q.,Liao,B.,Chen,J.,Zhou,H.,Zhang,Q.,Liu,W.,Huang,C.,Wang,X.,2023. Vad:Vectorizedscenerepresentation for efficient autonomous driving. arXiv preprint arXiv:2303.12077 URL:https://arxiv.org/abs/2303.12077,arXiv:2303.12077

work page arXiv 2023
[29]

Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice 94, 182–193

Kalra, N., Paddock, S.M., 2016. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice 94, 182–193. URL:https://www.sciencedirect.com/science/article/pii/ S0965856416302129, doi:https://doi.org/10.1016/j.tra.2016.09.010

work page doi:10.1016/j.tra.2016.09.010 2016
[30]

Challenges in autonomous vehicle testing and validation

Koopman, P., Wagner, M.D., 2016. Challenges in autonomous vehicle testing and validation. SAE International journal of transportation safety 4, 15–24. URL:https://api.semanticscholar.org/CorpusID:35473171

2016
[31]

arXiv preprint arXiv:2203.17270 (2022)

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J., 2022. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270 URL:https://arxiv.org/abs/2203.17270, arXiv:2203.17270

work page arXiv 2022
[32]

Bevfusion:Asimpleandrobustlidar-camerafusion framework

Liang,T.,Xie,H.,Yu,K.,Xia,Z.,Lin,Z.,Wang,Y.,Tang,T.,Wang,B.,Tang,Z.,2022. Bevfusion:Asimpleandrobustlidar-camerafusion framework. arXiv preprint arXiv:2205.13790 URL:https://arxiv.org/abs/2205.13790,arXiv:2205.13790

work page arXiv 2022
[33]

, Posch , C

Lichtsteiner,P.,Posch,C.,Delbruck,T.,2008. A128×128120db15𝜇slatencyasynchronoustemporalcontrastvisionsensor. IEEEJournal of Solid-State Circuits 43, 566–576. doi:10.1109/JSSC.2007.914337

work page doi:10.1109/jssc.2007.914337 2008
[34]

Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , year =

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S., 2023. Bevfusion: Multi-task multi-sensor fusion with unified bird’s- eye view representation, in: 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 2774–2781. doi:10.1109/ ICRA48891.2023.10160968

work page arXiv 2023
[35]

Event-based vision meets deep learning on steering prediction for self-driving cars, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Maqueda, A.I., Loquercio, A., Gallego, G., García, N., Scaramuzza, D., 2018. Event-based vision meets deep learning on steering prediction for self-driving cars, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5419–5427. URL:https: //openaccess.thecvf.com/content_cvpr_2018/papers/Maqueda_Event-Based_Vision_Meets_CVPR...

2018
[36]

Event-based asynchronous sparse convolutional networks

Messikommer, N., Gehrig, D., Loquercio, A., Scaramuzza, D., 2020. Event-based asynchronous sparse convolutional networks. CoRR abs/2003.09148. URL:https://arxiv.org/abs/2003.09148,arXiv:2003.09148

work page arXiv 2020
[37]

Visual-inertial monocular SLAM with map reuse

Mur-Artal, R., Tardós, J.D., 2016. Visual-inertial monocular SLAM with map reuse. CoRR abs/1610.05949. URL:http://arxiv.org/ abs/1610.05949,arXiv:1610.05949

work page arXiv 2016
[38]

Lmdrive dataset: 64k instruction-sensor-control data clips for autonomous driving.https://openxlab.org.cn/ datasets/deepcs233/LMDrive

OpenXLab, 2024. Lmdrive dataset: 64k instruction-sensor-control data clips for autonomous driving.https://openxlab.org.cn/ datasets/deepcs233/LMDrive. Multi-modal dataset collected in CARLA simulator with navigation instructions

2024
[39]

Deep learning with spiking neurons: Opportunities and challenges

Pfeiffer, M., Pfeil, T., 2018. Deep learning with spiking neurons: Opportunities and challenges. Frontiers in Neuroscience 12. URL: https://api.semanticscholar.org/CorpusID:53086859

2018
[40]

Alvinn: An autonomous land vehicle in a neural network, in: Advances in Neural Information Processing Systems, pp

Pomerleau, D.A., 1989. Alvinn: An autonomous land vehicle in a neural network, in: Advances in Neural Information Processing Systems, pp. 305–313. URL:https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf

1989
[41]

7077–7087

Prakash,A.,Chitta,K.,Geiger,A.,2021.Multi-modalfusiontransformerforend-to-endautonomousdriving,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077–7087. URL:https://arxiv.org/abs/2104.09224

work page arXiv 2021
[42]

Pointnet:Deeplearningonpointsetsfor3dclassificationandsegmentation,in:Proceedingsof the IEEE conference on computer vision and pattern recognition, pp

Qi,C.R.,Su,H.,Mo,K.,Guibas,L.J.,2017a. Pointnet:Deeplearningonpointsetsfor3dclassificationandsegmentation,in:Proceedingsof the IEEE conference on computer vision and pattern recognition, pp. 652–660
[43]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, pp

Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, pp. 5099–5108
[44]

Events-to-video: Bringing modern computer vision to event cameras, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D., 2019. Events-to-video: Bringing modern computer vision to event cameras, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3852–3861. doi:10.1109/CVPR.2019.00398

work page doi:10.1109/cvpr.2019.00398 2019
[45]

Towards spike-based machine intelligence with neuromorphic computing

Roy, K., Jaiswal, A.R., Panda, P., 2019. Towards spike-based machine intelligence with neuromorphic computing. Nature 575, 607 – 617. URL:https://api.semanticscholar.org/CorpusID:208329736

2019
[46]

Spiking neural networks for real-time pedestrian street-crossing detection using dynamic vision sensors in simulated adverse weather conditions

Sakhai, M., Mazurek, S., Caputa, J., Argasiński, J.K., Wielgosz, M., 2024. Spiking neural networks for real-time pedestrian street-crossing detection using dynamic vision sensors in simulated adverse weather conditions. Electronics 13. URL:https://www.mdpi.com/ 2079-9292/13/21/4280, doi:10.3390/electronics13214280

work page doi:10.3390/electronics13214280 2024
[47]

Evaluatingsyntheticvs.realdynamicvisionsensordataforsnn-based objectdetectionclassification

Sakhai,M.,Sithu,K.,Oke,M.K.S.,Mazurek,S.,Wielgosz,M.,2025a. Evaluatingsyntheticvs.realdynamicvisionsensordataforsnn-based objectdetectionclassification. Availableat:https://events.plgrid.pl/event/70/attachments/143/364/PROCEEDINGS%202025_ na%20www%20bez%20notatek.pdf
[48]

Cyberattack resilience of autonomous vehicle sensor systems: Evaluating rgb vs

Sakhai, M., Sithu, K., Oke, M.K.S., Wielgosz, M., 2025b. Cyberattack resilience of autonomous vehicle sensor systems: Evaluating rgb vs. dynamic vision sensors in carla. Applied Sciences 15, 7493. URL:https://doi.org/10.3390/app15137493, doi:10.3390/ app15137493

work page doi:10.3390/app15137493
[49]

Dvs-pedx: Synthetic-and-real event-based pedestrian dataset

Sakhai, M., Sithu, K., Oke, M.K.S., Wielgosz, M., 2026. Dvs-pedx: Synthetic-and-real event-based pedestrian dataset. Scientific Data 13,

2026
[50]

URL:https://doi.org/10.1038/s41597-026-06969-y, doi:10.1038/s41597-026-06969-y

work page doi:10.1038/s41597-026-06969-y
[51]

Computing spatial image convolutions for event cameras

Scheerlinck, C., Barnes, N., Mahony, R.E., 2018. Computing spatial image convolutions for event cameras. CoRR abs/1812.00438. URL: http://arxiv.org/abs/1812.00438,arXiv:1812.00438

work page arXiv 2018
[52]

Opportunities for neuromorphic computing algorithms and applications

Schuman, C.D., Kulkarni, S.R., Parsa, M., Mitchell, J.P., Date, P., Kay, B., 2022. Opportunities for neuromorphic computing algorithms and applications. NatureComputationalScience2. URL:https://www.osti.gov/biblio/1881146,doi:10.1038/s43588-021-00184-y

work page doi:10.1038/s43588-021-00184-y 2022
[53]

Safety-enhancedautonomousdrivingusinginterpretablesensorfusiontransformer

Shao,H.,Wang,L.,Chen,R.,Li,H.,Liu,Y.,2022. Safety-enhancedautonomousdrivingusinginterpretablesensorfusiontransformer. arXiv preprint arXiv:2207.14024 URL:https://arxiv.org/abs/2207.14024,arXiv:2207.14024

work page arXiv 2022
[54]

Reasonnet:End-to-enddriving withtemporaland globalreasoning,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Shao,H., Wang,L.,Chen, R.,Waslander,S.L., Liu,Y.,Ma, H.,2023. Reasonnet:End-to-enddriving withtemporaland globalreasoning,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13723–13733

2023
[55]

Pcla: A framework for testing autonomous agents in the carla simulator

Tehrani, M.J., Kim, J., Tonella, P., 2025. Pcla: A framework for testing autonomous agents in the carla simulator. arXiv preprint arXiv:2503.09385 URL:https://arxiv.org/abs/2503.09385,arXiv:2503.09385

work page arXiv 2025
[56]

End-to-end model-free reinforcement learning for urban driving using implicit affordances, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Toromanoff, M., Wirbel, E., Moutarde, F., 2020. End-to-end model-free reinforcement learning for urban driving using implicit affordances, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7153–7162. Sakhai et al.:Preprint submitted to ElsevierPage 16 of 17 InterFuserDVS

2020
[57]

Attention Is All You Need

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser,L.,Polosukhin,I.,2017. Attentionisallyouneed. CoRR abs/1706.03762. URL:http://arxiv.org/abs/1706.03762,arXiv:1706.03762

work page internal anchor Pith review arXiv 2017
[58]

The ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios

Vidal, A.R., Rebecq, H., Horstschaefer, T., Scaramuzza, D., 2018. The ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios. IEEE Robotics and Automation Letters 3, 2942–2949. URL:https://ieeexplore.ieee.org/ document/8337777

work page arXiv 2018
[59]

DETR3D:3dobjectdetectionfrommulti-viewimagesvia3d-to-2d queries

Wang,Y.,Guizilini,V.,Zhang,T.,Wang,Y.,Zhao,H.,Solomon,J.,2021. DETR3D:3dobjectdetectionfrommulti-viewimagesvia3d-to-2d queries. CoRR abs/2110.06922. URL:https://arxiv.org/abs/2110.06922,arXiv:2110.06922

work page arXiv 2021
[60]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Wu, P., Jia, X., Chen, L., Yan, J., Li, H., Qiao, Y., 2022. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. arXiv preprint arXiv:2206.08129 URL:https://arxiv.org/abs/2206.08129,arXiv:2206.08129

work page arXiv 2022
[61]

M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving.arXiv preprint arXiv:2403.12552, 2024

Xu, D., Li, H., Wang, Q., Song, Z., Chen, L., Deng, H., 2024. M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving. arXiv preprint arXiv:2403.12552

work page arXiv 2024
[62]

A survey of autonomous driving: Common practices and emerging technologies

Yurtsever, E., Lambert, J., Carballo, A., Takeda, K., 2020. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 8, 58443–58469. doi:10.1109/ACCESS.2020.2983149

work page doi:10.1109/access.2020.2983149 2020
[63]

Learning to see in the dark with events, in: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Zhang, S., Zhang, Y., Jiang, Z., Zou, D., Ren, J., Zhou, B., 2020. Learning to see in the dark with events, in: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds.), Computer Vision – ECCV 2020, Springer International Publishing, Cham. pp. 666–682

2020
[64]

In: Proceedings of the IEEE/CVF ICCV, pp

Zhang, Z., Liniger, A., Dai, D., Yu, F., Van Gool, L., 2021. End-to-end urban driving by imitating a reinforcement learning coach, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15202–15212. doi:10.1109/ICCV48922.2021.01494

work page doi:10.1109/iccv48922.2021.01494 2021
[65]

Unsupervised event-based learning of optical flow, depth, and egomotion

Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K., 2018. Unsupervised event-based learning of optical flow, depth, and egomotion. CoRR abs/1812.08156. URL:http://arxiv.org/abs/1812.08156,arXiv:1812.08156. Sakhai et al.:Preprint submitted to ElsevierPage 17 of 17

work page arXiv 2018