pith. sign in

arxiv: 2606.18242 · v1 · pith:V5BVJYV3new · submitted 2026-06-16 · 💻 cs.CV

EventDrive: Event Cameras for Vision-Language Driving Intelligence

Pith reviewed 2026-06-27 01:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords event camerasvision-language modelsautonomous drivingbenchmarkevent streamstemporal precisionmotion awareness
0
0 comments X

The pith

Event cameras integrated with vision-language models deliver gains in temporal precision, motion awareness, and robustness for driving tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EventDrive as a benchmark and model suite that combines asynchronous event streams from event cameras with RGB frames and language supervision. It structures evaluation around four dimensions—Perception, Understanding, Prediction, and Planning—using tasks such as image captioning, question answering, grounding, motion recognition, trajectory forecasting, and planning. EventDrive-VLM adds a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to fuse the different data types. A sympathetic reader would care because the work positions event sensing as a practical addition that addresses failure modes of frame-based systems in fast motion, blur, and glare. If the gains hold, event data would shift from niche to core input for reliable driving intelligence.

Core claim

EventDrive unifies event streams, RGB frames, and language supervision across Perception, Understanding, Prediction, and Planning with tasks including captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning; the accompanying EventDrive-VLM uses a multi-horizon event pyramid and temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous event information with frame-based data, yielding substantial gains in temporal precision, motion awareness, and robustness.

What carries the argument

The multi-horizon event pyramid together with the temporal-horizon mixture-of-experts module that adaptively encodes asynchronous event data and fuses it with RGB frames for downstream reasoning.

If this is right

  • Event streams improve accuracy on motion-state recognition and trajectory forecasting through higher temporal resolution.
  • The same fusion approach increases robustness on perception and grounding tasks when conventional frames suffer from blur or glare.
  • Planning and prediction modules receive more reliable motion cues, reducing errors that arise from missed temporal structure in RGB data alone.
  • Event sensing moves from an optional complement to a central component in vision-language driving systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid event-RGB models could be tested in existing simulation environments to measure end-to-end latency reductions before hardware deployment.
  • The benchmark tasks could be extended to include multi-agent interactions or long-horizon route planning to check whether gains persist beyond single-vehicle perception.
  • Manufacturers might evaluate whether the added temporal precision justifies the cost and calibration overhead of adding event cameras to production sensor suites.

Load-bearing premise

The chosen tasks and four dimensions are assumed to serve as a representative proxy for the complete driving decision loop.

What would settle it

A closed-loop vehicle test in which EventDrive-VLM planning outputs are compared against RGB-only baselines on collision rate and task completion under rapid motion and varying illumination; absence of measurable improvement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18242 by Ao Liang, Benoit R. Cottereau, Camille Simon Chane, Dongyue Lu, Lai Xing Ng, Lingdong Kong, Rong Li, Wei Tsang Ooi, Wei Yin.

Figure 1
Figure 1. Figure 1: Overview of the EventDrive benchmark. The dataset contains 471k event–frame–language samples across four levels of driving reasoning spanning 17 subtasks: Perception evaluates scene-level context such as scene type, traffic, and illumination. Understanding assesses object-centric semantics, including presence, motion state, and grounding. Prediction infers short-horizon motion intent of surrounding agents.… view at source ↗
Figure 2
Figure 2. Figure 2: Annotation pipelines of EventDrive. Perception converts scene-level attributes into structured QA; Understanding gen￾erates object-level semantic captions and transforms them into QA; Prediction extracts trajectories, applies ego-frame transformation, and assigns motion labels; Planning constructs ego-centric waypoints and produces corresponding decision-oriented supervision. task that aligns event represe… view at source ↗
Figure 3
Figure 3. Figure 3: EventDrive-VLM Overview. We first convert asynchronous events into multi-horizon voxel tensors that capture motion at different temporal scales. A dynamic horizon event encoder then aggregates these representations through a Mixture-of-Experts gating mechanism (cf . Sec. 4.1). An Event Q-Former performs cross-attention to extract language-aligned, motion-aware tokens, enabling co￾herent fusion of multi-mod… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on EventDrive comparing EventDrive-VLM with Qwen. Events remain reliable under low light and motion, improving scene perception, object understand￾ing, motion prediction, and ego intent estimation. tinctions and yields the weakest results, confirming that in￾discriminate fusion fails to preserve horizon-specific infor￾mation. Weighted summation (“Wt.sum”) improves perfor￾mance by allowi… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used to generate a scene caption capturing six [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used to assess the model’s Perception capa￾bility in driving scenes. and planning), we process agent and ego trajectories in the ego frame and derive high-level speed and path intents via kinematic rules. All annotations are formatted into consis￾tent two-turn Qwen-style conversations, ensuring that every modality, including event, image, and text, contributes a co￾herent supervisory signal for trai… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used to assess the model’s Understanding capability in driving scenes. excluded from final training data. ■ Stage 3: Multiple-Choice Construction and Format￾ting. The generated QA objects are then converted into the final format used by our training pipeline. Although the LLM provides candidate answers, we replace all op￾tions with a controlled set of answer choices for consistency across the datase… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used to assess the model’s object-awareness capability in driving scenes. sistant turn outputs the ground-truth choice in the required “letter text” format (e.g., “B Low light”). Once the QA pairs are constructed, we apply the prompt￾ing template illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used to generate QA pairs from a scene caption that encodes six essential attributes of the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used to assess the model’s Prediction ca￾pability in driving scenes. ground-truth pose logs, where each record contains a times￾tamp, position (t, x, y, z) and orientation represented as a quaternion. Given the image timestamp, the nearest pose is identified, and a temporal window of the past 3 seconds and future 5 seconds is collected around this index. Let the global trajectory be: P = {pt = (xt,… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt used to assess the model’s ego path and speed [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used to generate an object caption capturing essential object attributes of the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt used to generate QA pairs from an object caption that encodes essential object attributes in the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
read the original abstract

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces EventDrive, a large-scale benchmark and model suite unifying event streams, RGB frames, and language supervision for autonomous driving across four dimensions (Perception, Understanding, Prediction, Planning) with tasks including captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning. It proposes EventDrive-VLM featuring a multi-horizon event pyramid and temporal-horizon mixture-of-experts module for fusing asynchronous event data with frames, claiming that comprehensive evaluations demonstrate substantial gains from event streams in temporal precision, motion awareness, and robustness.

Significance. If the empirical gains hold under rigorous validation and the task suite adequately represents driving decision-making, the work could establish event cameras as a core modality in vision-language models for driving, extending their role from low-level perception to higher-level reasoning and planning under challenging conditions like blur and rapid motion.

major comments (1)
  1. [Abstract] Abstract and task selection section: the claim that event streams bring sensing 'into the center of driving intelligence' rests on the assumption that the four dimensions and six tasks (captions, QA, grounding, motion-state recognition, trajectory forecasting, planning) form a representative proxy for the full driving loop; however, these are open-loop proxies that omit closed-loop control, vehicle dynamics integration, and safety-critical decision metrics, so observed improvements in temporal precision do not establish the headline conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our benchmark. We agree that the evaluated tasks are open-loop proxies and will revise the abstract and task selection section to reflect this limitation more precisely while preserving the core contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract and task selection section: the claim that event streams bring sensing 'into the center of driving intelligence' rests on the assumption that the four dimensions and six tasks (captions, QA, grounding, motion-state recognition, trajectory forecasting, planning) form a representative proxy for the full driving loop; however, these are open-loop proxies that omit closed-loop control, vehicle dynamics integration, and safety-critical decision metrics, so observed improvements in temporal precision do not establish the headline conclusion.

    Authors: We acknowledge the validity of this observation. Our benchmark explicitly frames the four dimensions (Perception, Understanding, Prediction, Planning) and six tasks as open-loop proxies that isolate the contribution of event data to temporal precision and motion reasoning without closed-loop simulation or vehicle dynamics. The headline phrasing in the abstract was intended to emphasize that event sensing enables higher-level reasoning within these core components of driving intelligence, which are necessary (though not sufficient) for the full loop. We agree the language risks overgeneralization. In revision we will (1) qualify the abstract claim to specify 'open-loop driving intelligence tasks' and (2) add a limitations paragraph clarifying the absence of closed-loop control and safety metrics. These changes preserve the empirical findings while aligning the narrative with the evaluated scope. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper introduces a benchmark and model suite (EventDrive) and reports empirical gains from evaluations on perception, understanding, prediction, and planning tasks. No equations, first-principles derivations, parameter fits, or predictions are described that could reduce to inputs by construction. Claims rest on observed performance differences across tasks rather than any self-referential or fitted structure. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the described tasks and fusion module capture genuine event-driven improvements.

pith-pipeline@v0.9.1-grok · 5757 in / 1107 out tokens · 23396 ms · 2026-06-27T01:24:18.674588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 23 canonical work pages · 15 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. LLaV A-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  4. [4]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, et al. InternLM2 technical report.arXiv preprint arXiv:2403.17297, 2024

  5. [5]

    Hu Cao, Guang Chen, Zhijun Li, Yingbai Hu, and Alois Knoll. NeuroGrasp: multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation.IEEE Transactions on Instrumentation and Mea- surement, 71:1–11, 2022

  6. [6]

    Embracing events and frames with hierarchical feature refinement network for object detec- tion

    Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, and Alois Knoll. Embracing events and frames with hierarchical feature refinement network for object detec- tion. InEuropean Conference on Computer Vision. Springer, 2024

  7. [7]

    Recent event camera innovations: A survey

    Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Dani- ilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean Conference on Computer Vision Workshops. Springer, 2024

  8. [8]

    Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J

    Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M. Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J. Taylor, and Kostas Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4016–4023, 2023

  9. [9]

    Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

    Guang Chen, Hu Cao, Canbo Ye, Zhenyan Zhang, Xingbo Liu, Xuhui Mo, Zhongnan Qu, J ¨org Conradt, Florian R¨ohrbein, and Alois Knoll. Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

  10. [10]

    Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion

    Nicholas FY Chen. Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 644–653, 2018

  11. [11]

    CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP

    Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023

  12. [12]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024

  13. [13]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024

  14. [14]

    InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  15. [15]

    Impromptu VLA: Open weights and open data for driving vision-language-action models

    Haohan Chi, Huan ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, and Hao Zhao. Impromptu VLA: Open weights and open data for driving vision-language-action models. arXiv preprint arXiv:2505.23757, 2025

  16. [16]

    Label-free event-based object recognition via joint learning with image reconstruction from events

    Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, and Kuk- Jin Yoon. Label-free event-based object recognition via joint learning with image reconstruction from events. In IEEE/CVF International Conference on Computer Vision, pages 19866–19877, 2023

  17. [17]

    Ob- ject detection with spiking neural networks on automotive event data

    Lo ¨ıc Cordone, Benoˆıt Miramond, and Philippe Thierion. Ob- ject detection with spiking neural networks on automotive event data. InInternational Joint Conference on Neural Net- works, pages 1–8, 2022

  18. [18]

    Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier

    Javier Cuadrado, Ulysse Ranc ¸on, Benoit R. Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier. Optical flow es- timation from event-based cameras and spiking neural net- works.Frontiers in Neuroscience, 17:1160034, 2023

  19. [19]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. FlashAttention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  20. [20]

    Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

    Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

  21. [21]

    arXiv preprint arXiv:2410.16261 (2024)

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-InternVL: A flexible-transfer pocket multimodal model with 5% parameters and 90% perfor- mance.arXiv preprint arXiv:2410.16261, 2024

  22. [22]

    Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

    Daniel Gehrig and Davide Scaramuzza. Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

  23. [23]

    Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

    Daniel Gehrig and Davide Scaramuzza. Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

  24. [24]

    Derpa- nis, and Davide Scaramuzza

    Daniel Gehrig, Antonio Loquercio, Konstantinos G. Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InIEEE/CVF International Conference on Computer Vision, pages 5633– 5643, 2019

  25. [25]

    EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

    Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Da- vide Scaramuzza. EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

  26. [26]

    Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

    Daniel Gehrig, Michelle R ¨uegg, Mathias Gehrig, Javier Hidalgo-Carri´o, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

  27. [27]

    Recurrent vi- sion transformers for object detection with event cameras

    Mathias Gehrig and Davide Scaramuzza. Recurrent vi- sion transformers for object detection with event cameras. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2023

  28. [28]

    DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

  29. [29]

    Hierarchical neural memory network for low latency event processing

    Ryuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, and Ken Sakurada. Hierarchical neural memory network for low latency event processing. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22867–22876, 2023

  30. [30]

    Vision-language-action models for autonomous driving: Past, present, and future

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.1...

  31. [31]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wen- hai Wang, et al. Planning-oriented autonomous driving. InIEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  32. [32]

    Towards event-driven object detection with off-the-shelf deep learning

    Massimiliano Iacono, Stefan Weber, Arren Glover, and Chiara Bartolozzi. Towards event-driven object detection with off-the-shelf deep learning. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–9, 2018

  33. [33]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

  34. [34]

    Mixed frame- /event-driven fast pedestrian detection

    Zhuangyi Jiang, Pengfei Xia, Kai Huang, Walter Stechele, Guang Chen, Zhenshan Bing, and Alois Knoll. Mixed frame- /event-driven fast pedestrian detection. InIEEE Interna- tional Conference on Robotics and Automation, pages 8332– 8338, 2019

  35. [35]

    HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation

    Linglin Jing, Yiming Ding, Yunpeng Gao, Zhigang Wang, Xu Yan, Dong Wang, Gerald Schaefer, Hui Fang, Bin Zhao, and Xuelong Li. HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23128–23137, 2024

  36. [36]

    N-ImageNet: Towards robust, fine-grained object recognition with event cameras

    Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-ImageNet: Towards robust, fine-grained object recognition with event cameras. InIEEE/CVF Inter- national Conference on Computer Vision, pages 2146–2156, 2021

  37. [37]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference for Learning Representations, 2015

  38. [38]

    Cot- tereau, and Wei Tsang Ooi

    Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cot- tereau, and Wei Tsang Ooi. OpenESS: Event-based semantic scene understanding with open vocabularies. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15686–15698, 2024

  39. [39]

    Talk2Event: Grounded understanding of dynamic scenes from event cameras

    Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R Cottereau. Talk2Event: Grounded understanding of dynamic scenes from event cameras. InAdvances in Neu- ral Information Processing Systems, 2025

  40. [40]

    Cottereau

    Lingdong Kong, Dongyue Lu, Xiang Xu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R. Cottereau. EventFly: Event camera perception from ground to the sky. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1472–1484, 2025

  41. [41]

    Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

    Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

  42. [42]

    SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

    Dianze Li, Yonghong Tian, and Jianing Li. SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

  43. [43]

    Event-based vision enhanced: A joint de- tection framework in autonomous driving

    Jianing Li, Siwei Dong, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Event-based vision enhanced: A joint de- tection framework in autonomous driving. InIEEE Inter- national Conference on Multimedia and Expo, pages 1396– 1401, 2019

  44. [44]

    Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

    Jianing Li, Jia Li, Lin Zhu, Xijie Xiang, Tiejun Huang, and Yonghong Tian. Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

  45. [45]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023

  46. [46]

    EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

    Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

  47. [47]

    SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding

    Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Jun- wei Liang. SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3707– 3717, 2025

  48. [48]

    3EED: Ground everything everywhere in 3D

    Rong Li et al. 3EED: Ground everything everywhere in 3D. arXiv preprint arXiv:2511.01755, 2025

  49. [49]

    Perspective- invariant 3D object detection

    Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, and Wei Tsang Ooi. Perspective- invariant 3D object detection. InIEEE/CVF International Conference on Computer Vision, pages 27725–27738, 2025

  50. [50]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  51. [51]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023

  52. [52]

    Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

    Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

  53. [53]

    FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies

    Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, and Wei Tsang Ooi. FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies. InAdvances in Neural Information Processing Systems, 2025

  54. [54]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. DeepSeek-VL: Towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525, 2024

  55. [55]

    Event-based asynchronous sparse con- volutional networks

    Nico Messikommer, Daniel Gehrig, Antonio Loquercio, and Davide Scaramuzza. Event-based asynchronous sparse con- volutional networks. InEuropean Conference on Computer Vision, pages 415–431. Springer, 2020

  56. [56]

    Scene adaptive sparse transformer for event-based object detection

    Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16794–16804, 2024

  57. [57]

    Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

    Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

  58. [58]

    Qwen2.5 Technical Report

    Qwen, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  59. [59]

    AEGNN: Asynchronous event-based graph neural networks

    Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. AEGNN: Asynchronous event-based graph neural networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12371–12381, 2022

  60. [60]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  61. [61]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  62. [62]

    Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

    Lea Steffen, Daniel Reichard, Jakob Weinland, Jacques Kaiser, Arne Roennau, and R ¨udiger Dillmann. Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

  63. [63]

    Event-based object detection using graph neural networks

    Daobo Sun and Haibo Ji. Event-based object detection using graph neural networks. InIEEE Conference on Data Driven Control and Learning Systems, pages 1895–1900, 2023

  64. [64]

    Event-based fusion for motion deblurring with cross-modal attention

    Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. InEuropean Conference on Computer Vision, pages 412–428. Springer, 2022

  65. [65]

    ESS: Learning event-based semantic seg- mentation from still images

    Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Da- vide Scaramuzza. ESS: Learning event-based semantic seg- mentation from still images. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022

  66. [66]

    Fusing event- based and RGB camera for robust object detection in adverse conditions

    Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event- based and RGB camera for robust object detection in adverse conditions. InIEEE International Conference on Robotics and Automation, pages 933–939, 2022

  67. [67]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  68. [68]

    PARA-Drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized architecture for real-time autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024

  69. [69]

    EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

    Ziyi Wu, Xudong Liu, and Igor Gilitschenski. EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

  70. [70]

    LEOD: Label-efficient object detection for event cameras

    Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, and Igor Gilitschenski. LEOD: Label-efficient object detection for event cameras. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16933–16943, 2024

  71. [71]

    Spiking transformers for event-based single object tracking

    Jiqing Zhang, Bo Dong, Haiwei Zhang, Jianchuan Ding, Fe- lix Heide, Baocai Yin, and Xin Yang. Spiking transformers for event-based single object tracking. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022

  72. [72]

    LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs

    Hanyu Zhou and Gim Hee Lee. LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs. InIEEE/CVF International Confer- ence on Computer Vision, pages 22294–22304, 2025

  73. [73]

    RGB-event fusion for moving object detection in autonomous driving

    Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. RGB-event fusion for moving object detection in autonomous driving. InIEEE International Conference on Robotics and Automa- tion, pages 7808–7815, 2023

  74. [74]

    EV-FlowNet: Self- supervised optical flow estimation for event-based cameras

    Alex Zihao Zhu and Liangzhe Yuan. EV-FlowNet: Self- supervised optical flow estimation for event-based cameras. InRobotics: Science and Systems, 2018

  75. [75]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025