EventDrive: Event Cameras for Vision-Language Driving Intelligence

Ao Liang; Benoit R. Cottereau; Camille Simon Chane; Dongyue Lu; Lai Xing Ng; Lingdong Kong; Rong Li; Wei Tsang Ooi; Wei Yin

arxiv: 2606.18242 · v1 · pith:V5BVJYV3new · submitted 2026-06-16 · 💻 cs.CV

EventDrive: Event Cameras for Vision-Language Driving Intelligence

Dongyue Lu , Rong Li , Ao Liang , Lingdong Kong , Wei Yin , Lai Xing Ng , Benoit R. Cottereau , Camille Simon Chane

show 1 more author

Wei Tsang Ooi

This is my paper

Pith reviewed 2026-06-27 01:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords event camerasvision-language modelsautonomous drivingbenchmarkevent streamstemporal precisionmotion awareness

0 comments

The pith

Event cameras integrated with vision-language models deliver gains in temporal precision, motion awareness, and robustness for driving tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EventDrive as a benchmark and model suite that combines asynchronous event streams from event cameras with RGB frames and language supervision. It structures evaluation around four dimensions—Perception, Understanding, Prediction, and Planning—using tasks such as image captioning, question answering, grounding, motion recognition, trajectory forecasting, and planning. EventDrive-VLM adds a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to fuse the different data types. A sympathetic reader would care because the work positions event sensing as a practical addition that addresses failure modes of frame-based systems in fast motion, blur, and glare. If the gains hold, event data would shift from niche to core input for reliable driving intelligence.

Core claim

EventDrive unifies event streams, RGB frames, and language supervision across Perception, Understanding, Prediction, and Planning with tasks including captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning; the accompanying EventDrive-VLM uses a multi-horizon event pyramid and temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous event information with frame-based data, yielding substantial gains in temporal precision, motion awareness, and robustness.

What carries the argument

The multi-horizon event pyramid together with the temporal-horizon mixture-of-experts module that adaptively encodes asynchronous event data and fuses it with RGB frames for downstream reasoning.

If this is right

Event streams improve accuracy on motion-state recognition and trajectory forecasting through higher temporal resolution.
The same fusion approach increases robustness on perception and grounding tasks when conventional frames suffer from blur or glare.
Planning and prediction modules receive more reliable motion cues, reducing errors that arise from missed temporal structure in RGB data alone.
Event sensing moves from an optional complement to a central component in vision-language driving systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid event-RGB models could be tested in existing simulation environments to measure end-to-end latency reductions before hardware deployment.
The benchmark tasks could be extended to include multi-agent interactions or long-horizon route planning to check whether gains persist beyond single-vehicle perception.
Manufacturers might evaluate whether the added temporal precision justifies the cost and calibration overhead of adding event cameras to production sensor suites.

Load-bearing premise

The chosen tasks and four dimensions are assumed to serve as a representative proxy for the complete driving decision loop.

What would settle it

A closed-loop vehicle test in which EventDrive-VLM planning outputs are compared against RGB-only baselines on collision rate and task completion under rapid motion and varying illumination; absence of measurable improvement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.18242 by Ao Liang, Benoit R. Cottereau, Camille Simon Chane, Dongyue Lu, Lai Xing Ng, Lingdong Kong, Rong Li, Wei Tsang Ooi, Wei Yin.

**Figure 1.** Figure 1: Overview of the EventDrive benchmark. The dataset contains 471k event–frame–language samples across four levels of driving reasoning spanning 17 subtasks: Perception evaluates scene-level context such as scene type, traffic, and illumination. Understanding assesses object-centric semantics, including presence, motion state, and grounding. Prediction infers short-horizon motion intent of surrounding agents.… view at source ↗

**Figure 2.** Figure 2: Annotation pipelines of EventDrive. Perception converts scene-level attributes into structured QA; Understanding generates object-level semantic captions and transforms them into QA; Prediction extracts trajectories, applies ego-frame transformation, and assigns motion labels; Planning constructs ego-centric waypoints and produces corresponding decision-oriented supervision. task that aligns event represe… view at source ↗

**Figure 3.** Figure 3: EventDrive-VLM Overview. We first convert asynchronous events into multi-horizon voxel tensors that capture motion at different temporal scales. A dynamic horizon event encoder then aggregates these representations through a Mixture-of-Experts gating mechanism (cf . Sec. 4.1). An Event Q-Former performs cross-attention to extract language-aligned, motion-aware tokens, enabling coherent fusion of multi-mod… view at source ↗

**Figure 4.** Figure 4: Qualitative results on EventDrive comparing EventDrive-VLM with Qwen. Events remain reliable under low light and motion, improving scene perception, object understanding, motion prediction, and ego intent estimation. tinctions and yields the weakest results, confirming that indiscriminate fusion fails to preserve horizon-specific information. Weighted summation (“Wt.sum”) improves performance by allowi… view at source ↗

**Figure 5.** Figure 5: Prompt used to generate a scene caption capturing six [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to assess the model’s Perception capability in driving scenes. and planning), we process agent and ego trajectories in the ego frame and derive high-level speed and path intents via kinematic rules. All annotations are formatted into consistent two-turn Qwen-style conversations, ensuring that every modality, including event, image, and text, contributes a coherent supervisory signal for trai… view at source ↗

**Figure 8.** Figure 8: Prompt used to assess the model’s Understanding capability in driving scenes. excluded from final training data. ■ Stage 3: Multiple-Choice Construction and Formatting. The generated QA objects are then converted into the final format used by our training pipeline. Although the LLM provides candidate answers, we replace all options with a controlled set of answer choices for consistency across the datase… view at source ↗

**Figure 9.** Figure 9: Prompt used to assess the model’s object-awareness capability in driving scenes. sistant turn outputs the ground-truth choice in the required “letter text” format (e.g., “B Low light”). Once the QA pairs are constructed, we apply the prompting template illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used to generate QA pairs from a scene caption that encodes six essential attributes of the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used to assess the model’s Prediction capability in driving scenes. ground-truth pose logs, where each record contains a timestamp, position (t, x, y, z) and orientation represented as a quaternion. Given the image timestamp, the nearest pose is identified, and a temporal window of the past 3 seconds and future 5 seconds is collected around this index. Let the global trajectory be: P = {pt = (xt,… view at source ↗

**Figure 13.** Figure 13: Prompt used to assess the model’s ego path and speed [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used to generate an object caption capturing essential object attributes of the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used to generate QA pairs from an object caption that encodes essential object attributes in the driving scene. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

read the original abstract

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EventDrive adds a benchmark and MoE architecture for event-based vision-language driving tasks, but the abstract gives no numbers or ablations to check the claimed gains.

read the letter

The paper's main move is to create EventDrive, a benchmark that ties event streams to RGB and language across perception, understanding, prediction, and planning, plus a model with a multi-horizon event pyramid and temporal-horizon mixture-of-experts to fuse the data. That setup is new on the surface and tries to push event cameras past basic motion detection into higher-level driving reasoning.

The work does a clean job of naming the gap: most event-aware VLMs stop at perception, while this one adds structured QA, grounding, trajectory forecasting, and planning. The pyramid and MoE look like a practical way to handle different time scales without forcing everything into a single frame rate.

The soft spot is the lack of any results. The abstract states substantial gains in temporal precision and robustness, yet supplies no tables, error bars, dataset sizes, or ablations. Without those, the gains stay unverified. The stress-test concern also lands: the listed tasks may not stand in for closed-loop control or safety-critical vehicle dynamics, so improvements on the benchmark do not automatically mean better real-world decisions.

This is for researchers already working on event sensors in autonomous driving or multimodal models who need a starting point for event-language benchmarks. A reader who wants concrete evidence before adopting the architecture will find the current version thin.

I would send the full paper to review if the experiments turn out to be properly reported and the task suite is defended against the proxy issue. Otherwise it stays preliminary.

Referee Report

1 major / 0 minor

Summary. The paper introduces EventDrive, a large-scale benchmark and model suite unifying event streams, RGB frames, and language supervision for autonomous driving across four dimensions (Perception, Understanding, Prediction, Planning) with tasks including captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning. It proposes EventDrive-VLM featuring a multi-horizon event pyramid and temporal-horizon mixture-of-experts module for fusing asynchronous event data with frames, claiming that comprehensive evaluations demonstrate substantial gains from event streams in temporal precision, motion awareness, and robustness.

Significance. If the empirical gains hold under rigorous validation and the task suite adequately represents driving decision-making, the work could establish event cameras as a core modality in vision-language models for driving, extending their role from low-level perception to higher-level reasoning and planning under challenging conditions like blur and rapid motion.

major comments (1)

[Abstract] Abstract and task selection section: the claim that event streams bring sensing 'into the center of driving intelligence' rests on the assumption that the four dimensions and six tasks (captions, QA, grounding, motion-state recognition, trajectory forecasting, planning) form a representative proxy for the full driving loop; however, these are open-loop proxies that omit closed-loop control, vehicle dynamics integration, and safety-critical decision metrics, so observed improvements in temporal precision do not establish the headline conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our benchmark. We agree that the evaluated tasks are open-loop proxies and will revise the abstract and task selection section to reflect this limitation more precisely while preserving the core contribution.

read point-by-point responses

Referee: [Abstract] Abstract and task selection section: the claim that event streams bring sensing 'into the center of driving intelligence' rests on the assumption that the four dimensions and six tasks (captions, QA, grounding, motion-state recognition, trajectory forecasting, planning) form a representative proxy for the full driving loop; however, these are open-loop proxies that omit closed-loop control, vehicle dynamics integration, and safety-critical decision metrics, so observed improvements in temporal precision do not establish the headline conclusion.

Authors: We acknowledge the validity of this observation. Our benchmark explicitly frames the four dimensions (Perception, Understanding, Prediction, Planning) and six tasks as open-loop proxies that isolate the contribution of event data to temporal precision and motion reasoning without closed-loop simulation or vehicle dynamics. The headline phrasing in the abstract was intended to emphasize that event sensing enables higher-level reasoning within these core components of driving intelligence, which are necessary (though not sufficient) for the full loop. We agree the language risks overgeneralization. In revision we will (1) qualify the abstract claim to specify 'open-loop driving intelligence tasks' and (2) add a limitations paragraph clarifying the absence of closed-loop control and safety metrics. These changes preserve the empirical findings while aligning the narrative with the evaluated scope. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper introduces a benchmark and model suite (EventDrive) and reports empirical gains from evaluations on perception, understanding, prediction, and planning tasks. No equations, first-principles derivations, parameter fits, or predictions are described that could reduce to inputs by construction. Claims rest on observed performance differences across tasks rather than any self-referential or fitted structure. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the described tasks and fusion module capture genuine event-driven improvements.

pith-pipeline@v0.9.1-grok · 5757 in / 1107 out tokens · 23396 ms · 2026-06-27T01:24:18.674588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 23 canonical work pages · 15 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. LLaV A-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, et al. InternLM2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Hu Cao, Guang Chen, Zhijun Li, Yingbai Hu, and Alois Knoll. NeuroGrasp: multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation.IEEE Transactions on Instrumentation and Mea- surement, 71:1–11, 2022

2022
[6]

Embracing events and frames with hierarchical feature refinement network for object detec- tion

Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, and Alois Knoll. Embracing events and frames with hierarchical feature refinement network for object detec- tion. InEuropean Conference on Computer Vision. Springer, 2024

2024
[7]

Recent event camera innovations: A survey

Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Dani- ilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean Conference on Computer Vision Workshops. Springer, 2024

2024
[8]

Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M. Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J. Taylor, and Kostas Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4016–4023, 2023

2023
[9]

Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

Guang Chen, Hu Cao, Canbo Ye, Zhenyan Zhang, Xingbo Liu, Xuhui Mo, Zhongnan Qu, J ¨org Conradt, Florian R¨ohrbein, and Alois Knoll. Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

2019
[10]

Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion

Nicholas FY Chen. Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 644–653, 2018

2018
[11]

CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023

2023
[12]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024
[15]

Impromptu VLA: Open weights and open data for driving vision-language-action models

Haohan Chi, Huan ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, and Hao Zhao. Impromptu VLA: Open weights and open data for driving vision-language-action models. arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025
[16]

Label-free event-based object recognition via joint learning with image reconstruction from events

Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, and Kuk- Jin Yoon. Label-free event-based object recognition via joint learning with image reconstruction from events. In IEEE/CVF International Conference on Computer Vision, pages 19866–19877, 2023

2023
[17]

Ob- ject detection with spiking neural networks on automotive event data

Lo ¨ıc Cordone, Benoˆıt Miramond, and Philippe Thierion. Ob- ject detection with spiking neural networks on automotive event data. InInternational Joint Conference on Neural Net- works, pages 1–8, 2022

2022
[18]

Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier

Javier Cuadrado, Ulysse Ranc ¸on, Benoit R. Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier. Optical flow es- timation from event-based cameras and spiking neural net- works.Frontiers in Neuroscience, 17:1160034, 2023

2023
[19]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

2022
[21]

arXiv preprint arXiv:2410.16261 (2024)

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-InternVL: A flexible-transfer pocket multimodal model with 5% parameters and 90% perfor- mance.arXiv preprint arXiv:2410.16261, 2024

work page arXiv 2024
[22]

Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

Daniel Gehrig and Davide Scaramuzza. Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

work page arXiv 2022
[23]

Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

Daniel Gehrig and Davide Scaramuzza. Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

2024
[24]

Derpa- nis, and Davide Scaramuzza

Daniel Gehrig, Antonio Loquercio, Konstantinos G. Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InIEEE/CVF International Conference on Computer Vision, pages 5633– 5643, 2019

2019
[25]

EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Da- vide Scaramuzza. EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

2020
[26]

Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

Daniel Gehrig, Michelle R ¨uegg, Mathias Gehrig, Javier Hidalgo-Carri´o, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

2021
[27]

Recurrent vi- sion transformers for object detection with event cameras

Mathias Gehrig and Davide Scaramuzza. Recurrent vi- sion transformers for object detection with event cameras. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2023

2023
[28]

DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

2021
[29]

Hierarchical neural memory network for low latency event processing

Ryuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, and Ken Sakurada. Hierarchical neural memory network for low latency event processing. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22867–22876, 2023

2023
[30]

Vision-language-action models for autonomous driving: Past, present, and future

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.1...

work page arXiv 2025
[31]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wen- hai Wang, et al. Planning-oriented autonomous driving. InIEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[32]

Towards event-driven object detection with off-the-shelf deep learning

Massimiliano Iacono, Stefan Weber, Arren Glover, and Chiara Bartolozzi. Towards event-driven object detection with off-the-shelf deep learning. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–9, 2018

2018
[33]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Mixed frame- /event-driven fast pedestrian detection

Zhuangyi Jiang, Pengfei Xia, Kai Huang, Walter Stechele, Guang Chen, Zhenshan Bing, and Alois Knoll. Mixed frame- /event-driven fast pedestrian detection. InIEEE Interna- tional Conference on Robotics and Automation, pages 8332– 8338, 2019

2019
[35]

HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation

Linglin Jing, Yiming Ding, Yunpeng Gao, Zhigang Wang, Xu Yan, Dong Wang, Gerald Schaefer, Hui Fang, Bin Zhao, and Xuelong Li. HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23128–23137, 2024

2024
[36]

N-ImageNet: Towards robust, fine-grained object recognition with event cameras

Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-ImageNet: Towards robust, fine-grained object recognition with event cameras. InIEEE/CVF Inter- national Conference on Computer Vision, pages 2146–2156, 2021

2021
[37]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference for Learning Representations, 2015

2015
[38]

Cot- tereau, and Wei Tsang Ooi

Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cot- tereau, and Wei Tsang Ooi. OpenESS: Event-based semantic scene understanding with open vocabularies. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15686–15698, 2024

2024
[39]

Talk2Event: Grounded understanding of dynamic scenes from event cameras

Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R Cottereau. Talk2Event: Grounded understanding of dynamic scenes from event cameras. InAdvances in Neu- ral Information Processing Systems, 2025

2025
[40]

Cottereau

Lingdong Kong, Dongyue Lu, Xiang Xu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R. Cottereau. EventFly: Event camera perception from ground to the sky. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1472–1484, 2025

2025
[41]

Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

2025
[42]

SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

Dianze Li, Yonghong Tian, and Jianing Li. SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

2023
[43]

Event-based vision enhanced: A joint de- tection framework in autonomous driving

Jianing Li, Siwei Dong, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Event-based vision enhanced: A joint de- tection framework in autonomous driving. InIEEE Inter- national Conference on Multimedia and Expo, pages 1396– 1401, 2019

2019
[44]

Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

Jianing Li, Jia Li, Lin Zhu, Xijie Xiang, Tiejun Huang, and Yonghong Tian. Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

2022
[45]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023

2023
[46]

EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

work page arXiv 2025
[47]

SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Jun- wei Liang. SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3707– 3717, 2025

2025
[48]

3EED: Ground everything everywhere in 3D

Rong Li et al. 3EED: Ground everything everywhere in 3D. arXiv preprint arXiv:2511.01755, 2025

work page arXiv 2025
[49]

Perspective- invariant 3D object detection

Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, and Wei Tsang Ooi. Perspective- invariant 3D object detection. InIEEE/CVF International Conference on Computer Vision, pages 27725–27738, 2025

2025
[50]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023

2023
[52]

Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

work page arXiv 2024
[53]

FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies

Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, and Wei Tsang Ooi. FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies. InAdvances in Neural Information Processing Systems, 2025

2025
[54]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. DeepSeek-VL: Towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Event-based asynchronous sparse con- volutional networks

Nico Messikommer, Daniel Gehrig, Antonio Loquercio, and Davide Scaramuzza. Event-based asynchronous sparse con- volutional networks. InEuropean Conference on Computer Vision, pages 415–431. Springer, 2020

2020
[56]

Scene adaptive sparse transformer for event-based object detection

Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16794–16804, 2024

2024
[57]

Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

2020
[58]

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

AEGNN: Asynchronous event-based graph neural networks

Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. AEGNN: Asynchronous event-based graph neural networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12371–12381, 2022

2022
[60]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[62]

Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

Lea Steffen, Daniel Reichard, Jakob Weinland, Jacques Kaiser, Arne Roennau, and R ¨udiger Dillmann. Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

2019
[63]

Event-based object detection using graph neural networks

Daobo Sun and Haibo Ji. Event-based object detection using graph neural networks. InIEEE Conference on Data Driven Control and Learning Systems, pages 1895–1900, 2023

1900
[64]

Event-based fusion for motion deblurring with cross-modal attention

Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. InEuropean Conference on Computer Vision, pages 412–428. Springer, 2022

2022
[65]

ESS: Learning event-based semantic seg- mentation from still images

Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Da- vide Scaramuzza. ESS: Learning event-based semantic seg- mentation from still images. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022

2022
[66]

Fusing event- based and RGB camera for robust object detection in adverse conditions

Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event- based and RGB camera for robust object detection in adverse conditions. InIEEE International Conference on Robotics and Automation, pages 933–939, 2022

2022
[67]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

PARA-Drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized architecture for real-time autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024

2024
[69]

EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

work page arXiv 2023
[70]

LEOD: Label-efficient object detection for event cameras

Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, and Igor Gilitschenski. LEOD: Label-efficient object detection for event cameras. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16933–16943, 2024

2024
[71]

Spiking transformers for event-based single object tracking

Jiqing Zhang, Bo Dong, Haiwei Zhang, Jianchuan Ding, Fe- lix Heide, Baocai Yin, and Xin Yang. Spiking transformers for event-based single object tracking. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022

2022
[72]

LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs

Hanyu Zhou and Gim Hee Lee. LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs. InIEEE/CVF International Confer- ence on Computer Vision, pages 22294–22304, 2025

2025
[73]

RGB-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. RGB-event fusion for moving object detection in autonomous driving. InIEEE International Conference on Robotics and Automa- tion, pages 7808–7815, 2023

2023
[74]

EV-FlowNet: Self- supervised optical flow estimation for event-based cameras

Alex Zihao Zhu and Liangzhe Yuan. EV-FlowNet: Self- supervised optical flow estimation for event-based cameras. InRobotics: Science and Systems, 2018

2018
[75]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. LLaV A-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, et al. InternLM2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Hu Cao, Guang Chen, Zhijun Li, Yingbai Hu, and Alois Knoll. NeuroGrasp: multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation.IEEE Transactions on Instrumentation and Mea- surement, 71:1–11, 2022

2022

[6] [6]

Embracing events and frames with hierarchical feature refinement network for object detec- tion

Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, and Alois Knoll. Embracing events and frames with hierarchical feature refinement network for object detec- tion. InEuropean Conference on Computer Vision. Springer, 2024

2024

[7] [7]

Recent event camera innovations: A survey

Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Dani- ilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean Conference on Computer Vision Workshops. Springer, 2024

2024

[8] [8]

Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M. Ani Hsieh, Christopher Korpela, Vijay Ku- mar, Camillo J. Taylor, and Kostas Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4016–4023, 2023

2023

[9] [9]

Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

Guang Chen, Hu Cao, Canbo Ye, Zhenyan Zhang, Xingbo Liu, Xuhui Mo, Zhongnan Qu, J ¨org Conradt, Florian R¨ohrbein, and Alois Knoll. Multi-cue event information fu- sion for pedestrian detection with neuromorphic vision sen- sors.Frontiers in Neurorobotics, 13:10, 2019

2019

[10] [10]

Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion

Nicholas FY Chen. Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 644–653, 2018

2018

[11] [11]

CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP

Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wenping Wang. CLIP2Scene: Towards label-efficient 3D scene under- standing by CLIP. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023

2023

[12] [12]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024

[15] [15]

Impromptu VLA: Open weights and open data for driving vision-language-action models

Haohan Chi, Huan ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, and Hao Zhao. Impromptu VLA: Open weights and open data for driving vision-language-action models. arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025

[16] [16]

Label-free event-based object recognition via joint learning with image reconstruction from events

Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, and Kuk- Jin Yoon. Label-free event-based object recognition via joint learning with image reconstruction from events. In IEEE/CVF International Conference on Computer Vision, pages 19866–19877, 2023

2023

[17] [17]

Ob- ject detection with spiking neural networks on automotive event data

Lo ¨ıc Cordone, Benoˆıt Miramond, and Philippe Thierion. Ob- ject detection with spiking neural networks on automotive event data. InInternational Joint Conference on Neural Net- works, pages 1–8, 2022

2022

[18] [18]

Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier

Javier Cuadrado, Ulysse Ranc ¸on, Benoit R. Cottereau, Fran- cisco Barranco, and Timoth ´ee Masquelier. Optical flow es- timation from event-based cameras and spiking neural net- works.Frontiers in Neuroscience, 17:1160034, 2023

2023

[19] [19]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

2022

[21] [21]

arXiv preprint arXiv:2410.16261 (2024)

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-InternVL: A flexible-transfer pocket multimodal model with 5% parameters and 90% perfor- mance.arXiv preprint arXiv:2410.16261, 2024

work page arXiv 2024

[22] [22]

Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

Daniel Gehrig and Davide Scaramuzza. Pushing the limits of asynchronous graph-based object detection with event cam- eras.arXiv preprint arXiv:2211.12324, 2022

work page arXiv 2022

[23] [23]

Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

Daniel Gehrig and Davide Scaramuzza. Low-latency auto- motive vision with event cameras.Nature, 629(8014):1034– 1040, 2024

2024

[24] [24]

Derpa- nis, and Davide Scaramuzza

Daniel Gehrig, Antonio Loquercio, Konstantinos G. Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InIEEE/CVF International Conference on Computer Vision, pages 5633– 5643, 2019

2019

[25] [25]

EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Da- vide Scaramuzza. EKLT: asynchronous photometric feature tracking using events and frames.International Journal of Computer Vision, 128(3):601–618, 2020

2020

[26] [26]

Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

Daniel Gehrig, Michelle R ¨uegg, Mathias Gehrig, Javier Hidalgo-Carri´o, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021

2021

[27] [27]

Recurrent vi- sion transformers for object detection with event cameras

Mathias Gehrig and Davide Scaramuzza. Recurrent vi- sion transformers for object detection with event cameras. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13884–13893, 2023

2023

[28] [28]

DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. DSEC: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

2021

[29] [29]

Hierarchical neural memory network for low latency event processing

Ryuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, and Ken Sakurada. Hierarchical neural memory network for low latency event processing. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 22867–22876, 2023

2023

[30] [30]

Vision-language-action models for autonomous driving: Past, present, and future

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.1...

work page arXiv 2025

[31] [31]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wen- hai Wang, et al. Planning-oriented autonomous driving. InIEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023

[32] [32]

Towards event-driven object detection with off-the-shelf deep learning

Massimiliano Iacono, Stefan Weber, Arren Glover, and Chiara Bartolozzi. Towards event-driven object detection with off-the-shelf deep learning. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–9, 2018

2018

[33] [33]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Mixed frame- /event-driven fast pedestrian detection

Zhuangyi Jiang, Pengfei Xia, Kai Huang, Walter Stechele, Guang Chen, Zhenshan Bing, and Alois Knoll. Mixed frame- /event-driven fast pedestrian detection. InIEEE Interna- tional Conference on Robotics and Automation, pages 8332– 8338, 2019

2019

[35] [35]

HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation

Linglin Jing, Yiming Ding, Yunpeng Gao, Zhigang Wang, Xu Yan, Dong Wang, Gerald Schaefer, Hui Fang, Bin Zhao, and Xuelong Li. HPL-ESS: Hybrid pseudo-labeling for unsupervised event-based semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23128–23137, 2024

2024

[36] [36]

N-ImageNet: Towards robust, fine-grained object recognition with event cameras

Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-ImageNet: Towards robust, fine-grained object recognition with event cameras. InIEEE/CVF Inter- national Conference on Computer Vision, pages 2146–2156, 2021

2021

[37] [37]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference for Learning Representations, 2015

2015

[38] [38]

Cot- tereau, and Wei Tsang Ooi

Lingdong Kong, Youquan Liu, Lai Xing Ng, Benoit R. Cot- tereau, and Wei Tsang Ooi. OpenESS: Event-based semantic scene understanding with open vocabularies. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15686–15698, 2024

2024

[39] [39]

Talk2Event: Grounded understanding of dynamic scenes from event cameras

Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R Cottereau. Talk2Event: Grounded understanding of dynamic scenes from event cameras. InAdvances in Neu- ral Information Processing Systems, 2025

2025

[40] [40]

Cottereau

Lingdong Kong, Dongyue Lu, Xiang Xu, Lai Xing Ng, Wei Tsang Ooi, and Benoit R. Cottereau. EventFly: Event camera perception from ground to the sky. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1472–1484, 2025

2025

[41] [41]

Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3D scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3748–3765, 2025

2025

[42] [42]

SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

Dianze Li, Yonghong Tian, and Jianing Li. SODFormer: Streaming object detection with transformer using events and frames.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):14020–14037, 2023

2023

[43] [43]

Event-based vision enhanced: A joint de- tection framework in autonomous driving

Jianing Li, Siwei Dong, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Event-based vision enhanced: A joint de- tection framework in autonomous driving. InIEEE Inter- national Conference on Multimedia and Expo, pages 1396– 1401, 2019

2019

[44] [44]

Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

Jianing Li, Jia Li, Lin Zhu, Xijie Xiang, Tiejun Huang, and Yonghong Tian. Asynchronous spatio-temporal memory net- work for continuous event-based object detection.IEEE Transactions on Image Processing, 31:2975–2987, 2022

2022

[45] [45]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023

2023

[46] [46]

EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

work page arXiv 2025

[47] [47]

SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, and Jun- wei Liang. SeeGround: See and ground for zero-shot open- vocabulary 3D visual grounding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3707– 3717, 2025

2025

[48] [48]

3EED: Ground everything everywhere in 3D

Rong Li et al. 3EED: Ground everything everywhere in 3D. arXiv preprint arXiv:2511.01755, 2025

work page arXiv 2025

[49] [49]

Perspective- invariant 3D object detection

Ao Liang, Lingdong Kong, Dongyue Lu, Youquan Liu, Jian Fang, Huaici Zhao, and Wei Tsang Ooi. Perspective- invariant 3D object detection. InIEEE/CVF International Conference on Computer Vision, pages 27725–27738, 2025

2025

[50] [50]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023

2023

[52] [52]

Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Event- GPT: Event stream understanding with multimodal large lan- guage models.arXiv preprint arXiv:2412.00832, 2024

work page arXiv 2024

[53] [53]

FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies

Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, and Wei Tsang Ooi. FlexEvent: Towards flexible event-frame object detection at varying operational frequen- cies. InAdvances in Neural Information Processing Systems, 2025

2025

[54] [54]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. DeepSeek-VL: Towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Event-based asynchronous sparse con- volutional networks

Nico Messikommer, Daniel Gehrig, Antonio Loquercio, and Davide Scaramuzza. Event-based asynchronous sparse con- volutional networks. InEuropean Conference on Computer Vision, pages 415–431. Springer, 2020

2020

[56] [56]

Scene adaptive sparse transformer for event-based object detection

Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. Scene adaptive sparse transformer for event-based object detection. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16794–16804, 2024

2024

[57] [57]

Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020

2020

[58] [58]

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

AEGNN: Asynchronous event-based graph neural networks

Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. AEGNN: Asynchronous event-based graph neural networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12371–12381, 2022

2022

[60] [60]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[61] [61]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[62] [62]

Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

Lea Steffen, Daniel Reichard, Jakob Weinland, Jacques Kaiser, Arne Roennau, and R ¨udiger Dillmann. Neuromor- phic stereo vision: A survey of bio-inspired sensors and al- gorithms.Frontiers in Neuroscience, 13:28, 2019

2019

[63] [63]

Event-based object detection using graph neural networks

Daobo Sun and Haibo Ji. Event-based object detection using graph neural networks. InIEEE Conference on Data Driven Control and Learning Systems, pages 1895–1900, 2023

1900

[64] [64]

Event-based fusion for motion deblurring with cross-modal attention

Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. InEuropean Conference on Computer Vision, pages 412–428. Springer, 2022

2022

[65] [65]

ESS: Learning event-based semantic seg- mentation from still images

Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Da- vide Scaramuzza. ESS: Learning event-based semantic seg- mentation from still images. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022

2022

[66] [66]

Fusing event- based and RGB camera for robust object detection in adverse conditions

Abhishek Tomy, Anshul Paigwar, Khushdeep S Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event- based and RGB camera for robust object detection in adverse conditions. InIEEE International Conference on Robotics and Automation, pages 933–939, 2022

2022

[67] [67]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

PARA-Drive: Parallelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. PARA-Drive: Parallelized architecture for real-time autonomous driving. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449– 15458, 2024

2024

[69] [69]

EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. EventCLIP: Adapting CLIP for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023

work page arXiv 2023

[70] [70]

LEOD: Label-efficient object detection for event cameras

Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, and Igor Gilitschenski. LEOD: Label-efficient object detection for event cameras. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16933–16943, 2024

2024

[71] [71]

Spiking transformers for event-based single object tracking

Jiqing Zhang, Bo Dong, Haiwei Zhang, Jianchuan Ding, Fe- lix Heide, Baocai Yin, and Xin Yang. Spiking transformers for event-based single object tracking. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022

2022

[72] [72]

LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs

Hanyu Zhou and Gim Hee Lee. LLaFEA: Frame-event complementary fusion for fine-grained spatiotemporal un- derstanding in LMMs. InIEEE/CVF International Confer- ence on Computer Vision, pages 22294–22304, 2025

2025

[73] [73]

RGB-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. RGB-event fusion for moving object detection in autonomous driving. InIEEE International Conference on Robotics and Automa- tion, pages 7808–7815, 2023

2023

[74] [74]

EV-FlowNet: Self- supervised optical flow estimation for event-based cameras

Alex Zihao Zhu and Liangzhe Yuan. EV-FlowNet: Self- supervised optical flow estimation for event-based cameras. InRobotics: Science and Systems, 2018

2018

[75] [75]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025